Daily arXiv Papers - 2025-10-31

AI-enhanced summaries of 0 research papers from arXiv

Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

Table of Contents

cs.CL

[1] StreetMath: Study of LLMs’ Approximation Behaviors

Chiung-Yi Tseng, Somshubhra Roy, Maisha Thasin, Danyang Zhang, Blessing Effiong

Main category: cs.CL

TL;DR: The paper introduces StreetMath, a benchmark to evaluate LLMs’ approximation abilities in informal mathematical reasoning, revealing that models tend to compute exact values rather than approximate and use separate neural components for exact vs approximate operations.

DetailsMotivation: To address the gap in understanding LLMs' ability to perform approximate reasoning in informal, fast-paced mathematical operations, particularly for non-autoregressive decoder models.

Method: Introduced StreetMath benchmark and conducted extensive evaluations across different LLM architectures (Qwen3-4B-Instruct-2507, Qwen3-4B-Thinking-2507, Dream-v0-Instruct-7B, Falcon-Mamba-7B-Instruct, Mamba-GPT-3B) using mechanistic interpretability techniques to probe internal computational states.

Result: LLMs generally attempt to compute exact values or invoke external tools even in approximation tasks, consume more tokens when solving approximation tasks, and use largely separate neural components for exact vs approximate arithmetic operations.

Conclusion: LLMs do not exhibit cognitive miserliness like humans in street math settings, as they tend to over-compute rather than use efficient approximation strategies.

Abstract: There is a substantial body of literature examining the mathematical reasoning capabilities of large language models (LLMs), particularly their performance on precise arithmetic operations in autoregressive architectures. However, their ability to perform approximate reasoning in informal, fast-paced mathematical operations has received far less attention, especially among non-autoregressive decoder models. Our work addresses this gap by introducing StreetMath, a benchmark designed to evaluate models’ approximation abilities under real-world approximation scenarios. We conduct extensive evaluations across different LLM architectures: Qwen3-4B-Instruct-2507, Qwen3-4B-Thinking-2507, Dream-v0-Instruct-7B, Falcon-Mamba-7B-Instruct, and Mamba-GPT-3B. Furthermore, we apply mechanistic interpretability techniques to probe their internal computational states. Our analysis reveals that LLMs generally attempt to compute exact values or invoke external tools even in tasks that call for approximation. Moreover, while models sometimes reach the correct answer in early layers or steps, they still consume more tokens when solving approximation tasks. Additional experiments indicate that exact and approximate arithmetic operations rely on largely separate neural components. Drawing upon research on cognitive psychology, we argue that LLMs do not exhibit cognitive miserliness in the same way humans do in street math settings. We open source our work https://github.com/ctseng777/StreetMath

[2] Review Based Entity Ranking using Fuzzy Logic Algorithmic Approach: Analysis

Pratik N. Kalamkar, Anupama G. Phakatkar

Main category: cs.CL

TL;DR: Proposes a fuzzy logic-based approach to rank entities by analyzing opinion strength and orientation in reviews, classifying sentiments into granular levels (very weak to very strong) using opinion words related to specific product aspects.

DetailsMotivation: Traditional lexicon-based opinion mining doesn't consider opinion strength gradations, limiting the ability to distinguish between different levels of sentiment intensity in reviews.

Method: Combines fuzzy logic to classify opinion words into strength categories and syntactic dependency resolution to identify relations between opinion words and specific aspect words of interest.

Result: Enables more precise entity ranking based on both sentiment orientation and strength for specific product aspects.

Conclusion: The proposed approach provides finer-grained sentiment analysis by considering opinion strength levels, improving entity ranking accuracy compared to traditional binary sentiment classification.

Abstract: Opinion mining, also called sentiment analysis, is the field of study that analyzes people opinions, sentiments, evaluations, appraisals, attitudes, and emotions towards entities such as products, services, organizations, individuals, issues, events, topics, and their attributes. Holistic lexicon-based approach does not consider the strength of each opinion, i.e., whether the opinion is very strongly negative (or positive), strongly negative (or positive), moderate negative (or positive), very weakly negative (or positive) and weakly negative (or positive). In this paper, we propose approach to rank entities based on orientation and strength of the entity reviews and user’s queries by classifying them in granularity levels (i.e. very weak, weak, moderate, very strong and strong) by combining opinion words (i.e. adverb, adjective, noun and verb) that are related to aspect of interest of certain product. We shall use fuzzy logic algorithmic approach in order to classify opinion words into different category and syntactic dependency resolution to find relations for desired aspect words. Opinion words related to certain aspects of interest are considered to find the entity score for that aspect in the review.

[3] POWSM: A Phonetic Open Whisper-Style Speech Foundation Model

Chin-Jou Li, Kalvin Chang, Shikhar Bharadwaj, Eunjung Yeo, Kwanghee Choi, Jian Zhu, David Mortensen, Shinji Watanabe

Main category: cs.CL

TL;DR: POWSM is the first unified framework that jointly performs multiple phonetic tasks including phone recognition, grapheme-to-phoneme conversion, phoneme-to-grapheme conversion, and automatic speech recognition, outperforming specialized models while enabling universal speech processing.

DetailsMotivation: Current phonetic tasks like ASR, phone recognition, G2P, and P2G are studied in isolation with task-specific architectures and datasets, lacking a unified approach despite their conceptual similarity.

Method: Introduces POWSM (Phonetic Open Whisper-style Speech Model), a unified framework capable of jointly performing multiple phone-related tasks and enabling seamless conversion between audio, text, and phones.

Result: POWSM outperforms or matches specialized phone recognition models of similar size (Wav2Vec2Phoneme and ZIPA) while jointly supporting G2P, P2G, and ASR.

Conclusion: The unified framework opens up new possibilities for universal and low-resource speech processing, with training data, code and models released to foster open science.

Abstract: Recent advances in spoken language processing have led to substantial progress in phonetic tasks such as automatic speech recognition (ASR), phone recognition (PR), grapheme-to-phoneme conversion (G2P), and phoneme-to-grapheme conversion (P2G). Despite their conceptual similarity, these tasks have largely been studied in isolation, each relying on task-specific architectures and datasets. In this paper, we introduce POWSM (Phonetic Open Whisper-style Speech Model), the first unified framework capable of jointly performing multiple phone-related tasks. POWSM enables seamless conversion between audio, text (graphemes), and phones, opening up new possibilities for universal and low-resource speech processing. Our model outperforms or matches specialized PR models of similar size (Wav2Vec2Phoneme and ZIPA) while jointly supporting G2P, P2G, and ASR. Our training data, code and models are released to foster open science.

[4] LASTIST: LArge-Scale Target-Independent STance dataset

DongJae Kim, Yaejin Lee, Minsu Park, Eunil Park

Main category: cs.CL

TL;DR: The paper introduces LASTIST, a large-scale Korean dataset for target-independent stance detection with 563,299 labeled sentences from political party press releases.

DetailsMotivation: Most stance detection research focuses on target-dependent tasks and English datasets, creating a gap for low-resource languages like Korean and target-independent approaches.

Method: Collected press releases from Korean political parties to create a dataset of 563,299 labeled sentences, then trained state-of-the-art deep learning models for stance detection.

Result: Developed the LASTIST dataset supporting target-independent stance detection and diachronic evolution analysis, making it publicly available online.

Conclusion: The LASTIST dataset fills an important research gap by providing Korean language resources for target-independent stance detection and enables various stance analysis tasks.

Abstract: Stance detection has emerged as an area of research in the field of artificial intelligence. However, most research is currently centered on the target-dependent stance detection task, which is based on a person’s stance in favor of or against a specific target. Furthermore, most benchmark datasets are based on English, making it difficult to develop models in low-resource languages such as Korean, especially for an emerging field such as stance detection. This study proposes the LArge-Scale Target-Independent STance (LASTIST) dataset to fill this research gap. Collected from the press releases of both parties on Korean political parties, the LASTIST dataset uses 563,299 labeled Korean sentences. We provide a detailed description of how we collected and constructed the dataset and trained state-of-the-art deep learning and stance detection models. Our LASTIST dataset is designed for various tasks in stance detection, including target-independent stance detection and diachronic evolution stance detection. We deploy our dataset on https://anonymous.4open.science/r/LASTIST-3721/.

[5] zFLoRA: Zero-Latency Fused Low-Rank Adapters

Dhananjaya Gowda, Seoha Song, Harshith Goka, Junhyun Lee

Main category: cs.CL

TL;DR: zFLoRA is a zero-latency fused low-rank adapter that eliminates inference overhead while maintaining performance comparable to full fine-tuning and LoRA across various tasks.

DetailsMotivation: Current task-specific adapters in LLMs, despite having small parameter counts, cause significant inference latency (up to 2.5x base model), creating deployment bottlenecks.

Method: Proposed zFLoRA - a fused low-rank adapter that integrates with base model parameters to avoid additional compute during inference while maintaining adapter benefits.

Result: zFLoRA achieves comparable performance to LoRA and full fine-tuning on 18 tasks across commonsense reasoning, math reasoning, and summary-dialogue, with zero to negligible latency overhead on both NPU and GPU platforms.

Conclusion: zFLoRA provides an effective solution for deploying multiple task-specific adapters without inference latency penalties, making it suitable for resource-constrained environments.

Abstract: Large language models (LLMs) are increasingly deployed with task-specific adapters catering to multiple downstream applications. In such a scenario, the additional compute associated with these apparently insignificant number of adapter parameters (typically less than 1% of the base model) turns out to be disproportionately significant during inference time (upto 2.5x times that of the base model). In this paper, we propose a new zero-latency fused low-rank adapter (zFLoRA) that introduces zero or negligible latency overhead on top of the base model. Experimental results on LLMs of size 1B, 3B and 7B show that zFLoRA compares favorably against the popular supervised fine-tuning benchmarks including low-rank adapters (LoRA) as well as full fine-tuning (FFT). Experiments are conducted on 18 different tasks across three different categories namely commonsense reasoning, math reasoning and summary-dialogue. Latency measurements made on NPU (Samsung Galaxy S25+) as well as GPU (NVIDIA H100) platforms show that the proposed zFLoRA adapters introduce zero to negligible latency overhead.

[6] BlackboxNLP-2025 MIB Shared Task: Improving Circuit Faithfulness via Better Edge Selection

Yaniv Nikankin, Dana Arad, Itay Itzhak, Anja Reusch, Adi Simhi, Gal Kesten-Pomeranz, Yonatan Belinkov

Main category: cs.CL

TL;DR: Three improvements for circuit discovery in mechanistic interpretability: bootstrapping for consistent attributions, ratio-based edge selection, and integer linear programming instead of greedy selection.

DetailsMotivation: Address challenges in circuit discovery by improving reliability and faithfulness of identified circuits in mechanistic interpretability.

Method: Use bootstrapping to identify consistent attribution edges, ratio-based selection for positive-scoring edges, and integer linear programming formulation for circuit selection.

Result: Methods yield more faithful circuits and outperform prior approaches across multiple MIB tasks and models.

Conclusion: The proposed improvements enhance circuit discovery in mechanistic interpretability, achieving better performance and faithfulness.

Abstract: One of the main challenges in mechanistic interpretability is circuit discovery, determining which parts of a model perform a given task. We build on the Mechanistic Interpretability Benchmark (MIB) and propose three key improvements to circuit discovery. First, we use bootstrapping to identify edges with consistent attribution scores. Second, we introduce a simple ratio-based selection strategy to prioritize strong positive-scoring edges, balancing performance and faithfulness. Third, we replace the standard greedy selection with an integer linear programming formulation. Our methods yield more faithful circuits and outperform prior approaches across multiple MIB tasks and models. Our code is available at: https://github.com/technion-cs-nlp/MIB-Shared-Task.

[7] Evaluating Emotion Recognition in Spoken Language Models on Emotionally Incongruent Speech

Pedro Corrêa, João Lima, Victor Moreno, Lucas Ueda, Paula Dornhofer Paro Costa

Main category: cs.CL

TL;DR: SLMs rely more on text semantics than acoustic features for emotion recognition, especially when speech and text convey conflicting emotions.

DetailsMotivation: To evaluate if spoken language models truly integrate audio and text modalities or rely primarily on text representations for emotion recognition.

Method: Tested four SLMs on emotionally incongruent speech samples where semantic content and speech expressiveness convey different emotions.

Result: SLMs predominantly used textual semantics over acoustic features for emotion classification, showing text representations dominate.

Conclusion: Current SLMs have limited audio-text integration and rely heavily on text, highlighting need for better multimodal learning.

Abstract: Advancements in spoken language processing have driven the development of spoken language models (SLMs), designed to achieve universal audio understanding by jointly learning text and audio representations for a wide range of tasks. Although promising results have been achieved, there is growing discussion regarding these models’ generalization capabilities and the extent to which they truly integrate audio and text modalities in their internal representations. In this work, we evaluate four SLMs on the task of speech emotion recognition using a dataset of emotionally incongruent speech samples, a condition under which the semantic content of the spoken utterance conveys one emotion while speech expressiveness conveys another. Our results indicate that SLMs rely predominantly on textual semantics rather than speech emotion to perform the task, indicating that text-related representations largely dominate over acoustic representations. We release both the code and the Emotionally Incongruent Synthetic Speech dataset (EMIS) to the community.

[8] LISTEN to Your Preferences: An LLM Framework for Multi-Objective Selection

Adam S. Jovine, Tinghan Ye, Francis Bahk, Jingjing Wang, David B. Shmoys, Peter I. Frazier

Main category: cs.CL

TL;DR: LISTEN is a framework that uses LLMs as zero-shot preference oracles to help experts select optimal options from large sets with multiple competing objectives, using natural language guidance instead of formal preference models.

DetailsMotivation: Human experts struggle to formalize complex implicit preferences when selecting from large option sets with multiple objectives, creating a decision-making bottleneck.

Method: Two iterative algorithms: LISTEN-U (refines parametric utility function using LLM) and LISTEN-T (non-parametric tournament-style selection over small batches). Both use LLMs as zero-shot preference oracles guided by natural language priorities.

Result: Evaluated on flight booking, shopping, and exam scheduling tasks. LISTEN-U performs best when preferences align parametrically, while LISTEN-T offers more robust general performance.

Conclusion: LISTEN provides a promising approach for steering complex multi-objective decisions using natural language, reducing cognitive burden of traditional preference elicitation methods.

Abstract: Human experts often struggle to select the best option from a large set of items with multiple competing objectives, a process bottlenecked by the difficulty of formalizing complex, implicit preferences. To address this, we introduce LISTEN, a framework that leverages a Large Language Model (LLM) as a zero-shot preference oracle, guided only by an expert’s high-level priorities in natural language. To operate within LLM constraints like context windows and inference costs, we propose two iterative algorithms: LISTEN-U, which uses the LLM to refine a parametric utility function, and LISTEN-T, a non-parametric method that performs tournament-style selections over small batches of solutions. Evaluated on diverse tasks including flight booking, shopping, and exam scheduling, our results show LISTEN-U excels when preferences are parametrically aligned (a property we measure with a novel concordance metric), while LISTEN-T offers more robust performance. This work explores a promising direction for steering complex multi-objective decisions directly with natural language, reducing the cognitive burden of traditional preference elicitation.

[9] Dependency Structure Augmented Contextual Scoping Framework for Multimodal Aspect-Based Sentiment Analysis

Hao Liu, Lijun He, Jiaxi Liang, Zhihan Ren, Haixia Bi, Fan Li

Main category: cs.CL

TL;DR: DASCO is a dependency structure augmented scoping framework for multimodal aspect-based sentiment analysis that addresses three core challenges: sentiment cue perception, multimodal misalignment, and semantic noise elimination through dependency parsing and multi-task pretraining.

DetailsMotivation: Existing MABSA approaches fail to simultaneously address three key challenges: Sentiment Cue Perception (SCP), Multimodal Information Misalignment (MIM), and Semantic Noise Elimination (SNE).

Method: Proposes DASCO framework with multi-task pretraining combining aspect-oriented enhancement, image-text matching, and sentiment-sensitive cognition. Uses dependency trees as syntactic branch combined with semantic branch to guide attention to critical contextual elements while filtering noise.

Result: Achieves state-of-the-art performance on two benchmark datasets across three subtasks, with notable gains in JMASA (+2.3% F1 and +3.5% precision on Twitter2015).

Conclusion: DASCO effectively addresses the three core challenges in MABSA through dependency structure augmentation and scope-oriented framework, demonstrating superior performance over existing approaches.

Abstract: Multimodal Aspect-Based Sentiment Analysis (MABSA) seeks to extract fine-grained information from image-text pairs to identify aspect terms and determine their sentiment polarity. However, existing approaches often fall short in simultaneously addressing three core challenges: Sentiment Cue Perception (SCP), Multimodal Information Misalignment (MIM), and Semantic Noise Elimination (SNE). To overcome these limitations, we propose DASCO (\textbf{D}ependency Structure \textbf{A}ugmented \textbf{Sco}ping Framework), a fine-grained scope-oriented framework that enhances aspect-level sentiment reasoning by leveraging dependency parsing trees. First, we designed a multi-task pretraining strategy for MABSA on our base model, combining aspect-oriented enhancement, image-text matching, and aspect-level sentiment-sensitive cognition. This improved the model’s perception of aspect terms and sentiment cues while achieving effective image-text alignment, addressing key challenges like SCP and MIM. Furthermore, we incorporate dependency trees as syntactic branch combining with semantic branch, guiding the model to selectively attend to critical contextual elements within a target-specific scope while effectively filtering out irrelevant noise for addressing SNE problem. Extensive experiments on two benchmark datasets across three subtasks demonstrate that DASCO achieves state-of-the-art performance in MABSA, with notable gains in JMASA (+2.3% F1 and +3.5% precision on Twitter2015). The source code is available at https://github.com/LHaoooo/DASCO .

[10] Beyond Length: Quantifying Long-Range Information for Long-Context LLM Pretraining Data

Haoran Deng, Yingyu Lin, Zhenghao Lin, Xiao Liu, Yizhou Sun, Yi-An Ma, Yeyun Gong

Main category: cs.CL

TL;DR: LongFilter is a framework that selects training data with meaningful long-distance dependencies for efficient long-context pretraining by measuring information gain from extended context.

DetailsMotivation: Most available long-text data lacks meaningful long-distance dependencies, making training inefficient. Careful data selection is crucial for effective long-context language model training.

Method: LongFilter measures information gain by contrasting model predictions under long-context vs short-context settings to identify samples where long-range dependencies are essential.

Result: Experiments with LLaMA-3-8B extending context from 8K to 64K tokens show LongFilter efficiently selects high-quality data and yields substantial improvements on HELMET, LongBench, and RULER benchmarks.

Conclusion: LongFilter provides an effective framework for curating training data tailored to long-context pretraining, enabling more efficient and effective training of long-context language models.

Abstract: Long-context language models unlock advanced capabilities in reasoning, code generation, and document summarization by leveraging dependencies across extended spans of text. However, a significant portion of readily available long-text data lacks meaningful long-distance dependencies; most spans can be predicted using only local context. Training on such data is inefficient, making careful data selection crucial. Therefore, we introduce LongFilter, a framework for curating training data tailored to long-context pretraining. LongFilter measures the information gain provided by extended context by contrasting model predictions under long-context versus short-context settings, thereby identifying samples where long-range dependencies are essential. Experiments with LLaMA-3-8B, extending its context length from 8K to 64K, show that LongFilter efficiently selects high-quality data and yields substantial improvements on benchmarks such as HELMET, LongBench, and RULER.

[11] Ideology-Based LLMs for Content Moderation

Stefano Civelli, Pietro Bernardelle, Nardiena A. Pratama, Gianluca Demartini

Main category: cs.CL

TL;DR: Persona adoption in LLMs introduces subtle ideological biases in harmful content classification, affecting fairness and neutrality despite similar overall accuracy.

DetailsMotivation: To examine how persona adoption influences consistency and fairness of harmful content classification across different LLM architectures, model sizes, and content modalities.

Method: Analyzed persona adoption effects on harmful content classification across various LLM architectures and sizes, with additional study on politically targeted tasks to examine ideological alignment.

Result: Personas with different ideological leanings show distinct propensities to label content as harmful, with models aligning more closely with personas from same political ideology, strengthening within-ideology consistency while widening divergence across groups.

Conclusion: Persona conditioning can introduce subtle ideological biases into LLM outputs, raising concerns about AI systems potentially reinforcing partisan perspectives under the guise of neutrality.

Abstract: Large language models (LLMs) are increasingly used in content moderation systems, where ensuring fairness and neutrality is essential. In this study, we examine how persona adoption influences the consistency and fairness of harmful content classification across different LLM architectures, model sizes, and content modalities (language vs. vision). At first glance, headline performance metrics suggest that personas have little impact on overall classification accuracy. However, a closer analysis reveals important behavioral shifts. Personas with different ideological leanings display distinct propensities to label content as harmful, showing that the lens through which a model “views” input can subtly shape its judgments. Further agreement analyses highlight that models, particularly larger ones, tend to align more closely with personas from the same political ideology, strengthening within-ideology consistency while widening divergence across ideological groups. To show this effect more directly, we conducted an additional study on a politically targeted task, which confirmed that personas not only behave more coherently within their own ideology but also exhibit a tendency to defend their perspective while downplaying harmfulness in opposing views. Together, these findings highlight how persona conditioning can introduce subtle ideological biases into LLM outputs, raising concerns about the use of AI systems that may reinforce partisan perspectives under the guise of neutrality.

[12] The Geometry of Dialogue: Graphing Language Models to Reveal Synergistic Teams for Multi-Agent Collaboration

Kotaro Furuya, Yuichi Kitagawa

Main category: cs.CL

TL;DR: Proposes an interaction-centric framework for automatic LLM team composition using semantic coherence from pairwise conversations to form synergistic teams without prior knowledge of model internals.

DetailsMotivation: Multi-agent LLM approaches show promise but require optimal team composition, which is challenging due to model opacity and lack of internal characteristic knowledge.

Method: Constructs a ’language model graph’ from semantic coherence of pairwise conversations, then applies community detection to identify synergistic model clusters without requiring model architecture or training data.

Result: Method discovers functionally coherent groups reflecting latent specializations. Synergistic teams outperform random baselines and achieve comparable accuracy to manually-curated teams.

Conclusion: Provides a new basis for automated design of collaborative multi-agent LLM teams through interaction-centric composition.

Abstract: While a multi-agent approach based on large language models (LLMs) represents a promising strategy to surpass the capabilities of single models, its success is critically dependent on synergistic team composition. However, forming optimal teams is a significant challenge, as the inherent opacity of most models obscures the internal characteristics necessary for effective collaboration. In this paper, we propose an interaction-centric framework for automatic team composition that does not require any prior knowledge including their internal architectures, training data, or task performances. Our method constructs a “language model graph” that maps relationships between models from the semantic coherence of pairwise conversations, and then applies community detection to identify synergistic model clusters. Our experiments with diverse LLMs demonstrate that the proposed method discovers functionally coherent groups that reflect their latent specializations. Priming conversations with specific topics identified synergistic teams which outperform random baselines on downstream benchmarks and achieve comparable accuracy to that of manually-curated teams based on known model specializations. Our findings provide a new basis for the automated design of collaborative multi-agent LLM teams.

[13] Beyond Long Context: When Semantics Matter More than Tokens

Tarun Kumar Chawdhury, Jon D. Duke

Main category: cs.CL

TL;DR: CLEAR method uses entity-aware retrieval for clinical question answering, achieving better performance with significantly fewer tokens than traditional methods.

DetailsMotivation: EHR clinical documentation is stored as base64 encoded attachments, making semantic question answering difficult. Traditional vector database methods miss nuanced clinical relationships.

Method: Clinical Entity Augmented Retrieval (CLEAR) method using entity aware retrieval, validated against zero shot large context inference and traditional chunk based retrieval augmented generation.

Result: CLEAR achieved F1 score of 0.90 vs 0.86 for embedding-based retrieval, 58.3% win rate, average semantic similarity of 0.878, and used 78% fewer tokens than wide context processing. Largest gains on long notes (75% win rate for 65k+ token documents).

Conclusion: Entity aware retrieval improves both efficiency and accuracy in clinical NLP. The evaluation framework provides reusable benchmark for clinical QA systems where semantic precision and computational efficiency are critical.

Abstract: Electronic Health Records (EHR) store clinical documentation as base64 encoded attachments in FHIR DocumentReference resources, which makes semantic question answering difficult. Traditional vector database methods often miss nuanced clinical relationships. The Clinical Entity Augmented Retrieval (CLEAR) method, introduced by Lopez et al. 2025, uses entity aware retrieval and achieved improved performance with an F1 score of 0.90 versus 0.86 for embedding based retrieval, while using over 70 percent fewer tokens. We developed a Clinical Notes QA Evaluation Platform to validate CLEAR against zero shot large context inference and traditional chunk based retrieval augmented generation. The platform was tested on 12 clinical notes ranging from 10,000 to 65,000 tokens representing realistic EHR content. CLEAR achieved a 58.3 percent win rate, an average semantic similarity of 0.878, and used 78 percent fewer tokens than wide context processing. The largest performance gains occurred on long notes, with a 75 percent win rate for documents exceeding 65,000 tokens. These findings confirm that entity aware retrieval improves both efficiency and accuracy in clinical natural language processing. The evaluation framework provides a reusable and transparent benchmark for assessing clinical question answering systems where semantic precision and computational efficiency are critical.

[14] A Survey on Efficient Large Language Model Training: From Data-centric Perspectives

Junyu Luo, Bohan Wu, Xiao Luo, Zhiping Xiao, Yiqiao Jin, Rong-Cheng Tu, Nan Yin, Yifan Wang, Jingyang Yuan, Wei Ju, Ming Zhang

Main category: cs.CL

TL;DR: This paper presents a systematic survey of data-efficient post-training methods for Large Language Models (LLMs) from a data-centric perspective, proposing a taxonomy that covers data selection, quality enhancement, synthetic data generation, data distillation, and self-evolving data ecosystems.

DetailsMotivation: Current LLM post-training faces significant data challenges including high annotation costs and diminishing returns on data scale, making data-efficient post-training a crucial research question.

Method: The authors conduct a systematic survey and propose a taxonomy of data-efficient LLM post-training methods, categorizing approaches into data selection, data quality enhancement, synthetic data generation, data distillation and compression, and self-evolving data ecosystems.

Result: The paper summarizes representative approaches in each category of the proposed taxonomy and outlines future research directions for data-efficient LLM post-training.

Conclusion: By examining challenges in data-efficient LLM post-training, the authors highlight open problems and propose potential research avenues, aiming to inspire further exploration into maximizing data utilization in large-scale model training.

Abstract: Post-training of Large Language Models (LLMs) is crucial for unlocking their task generalization potential and domain-specific capabilities. However, the current LLM post-training paradigm faces significant data challenges, including the high costs of manual annotation and diminishing marginal returns on data scales. Therefore, achieving data-efficient post-training has become a key research question. In this paper, we present the first systematic survey of data-efficient LLM post-training from a data-centric perspective. We propose a taxonomy of data-efficient LLM post-training methods, covering data selection, data quality enhancement, synthetic data generation, data distillation and compression, and self-evolving data ecosystems. We summarize representative approaches in each category and outline future research directions. By examining the challenges in data-efficient LLM post-training, we highlight open problems and propose potential research avenues. We hope our work inspires further exploration into maximizing the potential of data utilization in large-scale model training. Paper List: https://github.com/luo-junyu/Awesome-Data-Efficient-LLM

[15] Evaluating the Impact of LLM-Assisted Annotation in a Perspectivized Setting: the Case of FrameNet Annotation

Frederico Belcavello, Ely Matos, Arthur Lorenzi, Lisandra Bonoto, Lívia Ruiz, Luiz Fernando Pereira, Victor Herbst, Yulla Navarro, Helen de Andrade Abreu, Lívia Dutra, Tiago Timponi Torrent

Main category: cs.CL

TL;DR: Evaluation of LLM-based semantic role labeling for FrameNet annotation shows semi-automatic approach increases frame diversity while maintaining coverage, outperforming fully automatic methods.

DetailsMotivation: To assess the performance and impact of LLM-based tools in creating annotated datasets, particularly for semantic annotation tasks, as comprehensive evaluation under perspectivized NLP approaches is lacking.

Method: Compared annotation time, coverage and diversity across three settings: manual, automatic, and semi-automatic annotation using an LLM-based semantic role labeler for FrameNet-like semantic annotation.

Result: Semi-automatic annotation achieved increased frame diversity and similar coverage compared to manual annotation, while automatic annotation performed worse in all metrics except annotation time.

Conclusion: Hybrid semi-automatic annotation using LLMs provides the best balance, enhancing frame diversity while maintaining coverage, making it superior to fully automatic approaches for semantic annotation tasks.

Abstract: The use of LLM-based applications as a means to accelerate and/or substitute human labor in the creation of language resources and dataset is a reality. Nonetheless, despite the potential of such tools for linguistic research, comprehensive evaluation of their performance and impact on the creation of annotated datasets, especially under a perspectivized approach to NLP, is still missing. This paper contributes to reduction of this gap by reporting on an extensive evaluation of the (semi-)automatization of FrameNet-like semantic annotation by the use of an LLM-based semantic role labeler. The methodology employed compares annotation time, coverage and diversity in three experimental settings: manual, automatic and semi-automatic annotation. Results show that the hybrid, semi-automatic annotation setting leads to increased frame diversity and similar annotation coverage, when compared to the human-only setting, while the automatic setting performs considerably worse in all metrics, except for annotation time.

[16] RECAP: Reproducing Copyrighted Data from LLMs Training with an Agentic Pipeline

André V. Duarte, Xuying li, Bin Zeng, Arlindo L. Oliveira, Lei Li, Zhuo Li

Main category: cs.CL

TL;DR: RECAP is an agentic pipeline that extracts memorized training data from LLMs using a feedback loop with correction hints and jailbreaking to overcome refusals.

DetailsMotivation: To verify what training data LLMs have seen when direct inspection is impossible, by eliciting and verifying memorized content through model outputs.

Method: Uses a feedback-driven loop where initial extractions are evaluated by a secondary model, discrepancies are identified and translated into correction hints, then fed back to the target model. Includes jailbreaking module to overcome alignment refusals.

Result: On EchoTrace benchmark (30+ books), RECAP achieved 24% improvement in ROUGE-L score (0.38 to 0.47) for copyrighted text extraction with GPT-4.1 compared to single-iteration approaches.

Conclusion: RECAP effectively extracts memorized training data from LLMs through iterative refinement and jailbreaking, providing compelling evidence of what models have seen during training.

Abstract: If we cannot inspect the training data of a large language model (LLM), how can we ever know what it has seen? We believe the most compelling evidence arises when the model itself freely reproduces the target content. As such, we propose RECAP, an agentic pipeline designed to elicit and verify memorized training data from LLM outputs. At the heart of RECAP is a feedback-driven loop, where an initial extraction attempt is evaluated by a secondary language model, which compares the output against a reference passage and identifies discrepancies. These are then translated into minimal correction hints, which are fed back into the target model to guide subsequent generations. In addition, to address alignment-induced refusals, RECAP includes a jailbreaking module that detects and overcomes such barriers. We evaluate RECAP on EchoTrace, a new benchmark spanning over 30 full books, and the results show that RECAP leads to substantial gains over single-iteration approaches. For instance, with GPT-4.1, the average ROUGE-L score for the copyrighted text extraction improved from 0.38 to 0.47 - a nearly 24% increase.

[17] Revisiting Multilingual Data Mixtures in Language Model Pretraining

Negar Foroutan, Paul Teiletche, Ayush Kumar Tarun, Antoine Bosselut

Main category: cs.CL

TL;DR: Multilingual LLM training with 25-400 languages shows no performance degradation when properly balanced, challenges the ‘curse of multilinguality’, and reveals English as effective pivot language across language families.

DetailsMotivation: To investigate concerns about trade-offs between language coverage and model performance in multilingual LLM pretraining, specifically testing the 'curse of multilinguality' assumption.

Method: Trained 1.1B and 3B parameter LLMs on diverse multilingual corpora varying from 25 to 400 languages, analyzing different data mixtures and pivot language strategies.

Result: No significant performance degradation when combining English and multilingual data; English serves as effective pivot language across families; no observed curse of multilinguality at this scale.

Conclusion: Multilingual data, when balanced appropriately, can enhance language model capabilities without compromising performance, even in low-resource settings.

Abstract: The impact of different multilingual data mixtures in pretraining large language models (LLMs) has been a topic of ongoing debate, often raising concerns about potential trade-offs between language coverage and model performance (i.e., the curse of multilinguality). In this work, we investigate these assumptions by training 1.1B and 3B parameter LLMs on diverse multilingual corpora, varying the number of languages from 25 to 400. Our study challenges common beliefs surrounding multilingual training. First, we find that combining English and multilingual data does not necessarily degrade the in-language performance of either group, provided that languages have a sufficient number of tokens included in the pretraining corpus. Second, we observe that using English as a pivot language (i.e., a high-resource language that serves as a catalyst for multilingual generalization) yields benefits across language families, and contrary to expectations, selecting a pivot language from within a specific family does not consistently improve performance for languages within that family. Lastly, we do not observe a significant “curse of multilinguality” as the number of training languages increases in models at this scale. Our findings suggest that multilingual data, when balanced appropriately, can enhance language model capabilities without compromising performance, even in low-resource settings

[18] Semantic Label Drift in Cross-Cultural Translation

Mohsinul Kabir, Tasnim Ahmed, Md Mezbaur Rahman, Polydoros Giannouris, Sophia Ananiadou

Main category: cs.CL

TL;DR: Machine Translation induces label drift due to cultural divergence, especially in culturally sensitive domains, with LLMs amplifying this effect through encoded cultural knowledge.

DetailsMotivation: To investigate how cultural alignment between source and target languages affects semantic label preservation in Machine Translation, particularly in low-resource language scenarios.

Method: Conducted experiments across culturally sensitive and neutral domains using various MT systems including modern LLMs, analyzing label drift during translation.

Result: Found that MT systems induce label drift, LLMs encode cultural knowledge that amplifies drift, and cultural similarity between languages is crucial for label preservation.

Conclusion: Neglecting cultural factors in MT undermines label fidelity and risks misinterpretation and cultural conflict in downstream applications.

Abstract: Machine Translation (MT) is widely employed to address resource scarcity in low-resource languages by generating synthetic data from high-resource counterparts. While sentiment preservation in translation has long been studied, a critical but underexplored factor is the role of cultural alignment between source and target languages. In this paper, we hypothesize that semantic labels are drifted or altered during MT due to cultural divergence. Through a series of experiments across culturally sensitive and neutral domains, we establish three key findings: (1) MT systems, including modern Large Language Models (LLMs), induce label drift during translation, particularly in culturally sensitive domains; (2) unlike earlier statistical MT tools, LLMs encode cultural knowledge, and leveraging this knowledge can amplify label drift; and (3) cultural similarity or dissimilarity between source and target languages is a crucial determinant of label preservation. Our findings highlight that neglecting cultural factors in MT not only undermines label fidelity but also risks misinterpretation and cultural conflict in downstream applications.

[19] SymCode: A Neurosymbolic Approach to Mathematical Reasoning via Verifiable Code Generation

Sina Bagheri Nezhad, Yao Li, Ameeta Agrawal

Main category: cs.CL

TL;DR: SymCode is a neurosymbolic framework that improves mathematical reasoning in LLMs by generating verifiable SymPy code instead of prose, achieving up to 13.6% accuracy improvements on challenging benchmarks.

DetailsMotivation: LLMs struggle with complex mathematical reasoning where prose-based generation leads to unverified and arithmetically unsound solutions, lacking deterministic verification mechanisms.

Method: Reframes mathematical problem-solving as verifiable code generation using the SymPy library, grounding LLM reasoning in a deterministic symbolic engine.

Result: Achieved significant accuracy improvements of up to 13.6 percentage points over baselines on MATH-500 and OlympiadBench benchmarks, with better token efficiency and transparent error types.

Conclusion: SymCode represents a key step towards more accurate and trustworthy AI in formal domains by shifting model failures from opaque logical fallacies to transparent programmatic errors.

Abstract: Large Language Models (LLMs) often struggle with complex mathematical reasoning, where prose-based generation leads to unverified and arithmetically unsound solutions. Current prompting strategies like Chain of Thought still operate within this unreliable medium, lacking a mechanism for deterministic verification. To address these limitations, we introduce SymCode, a neurosymbolic framework that reframes mathematical problem-solving as a task of verifiable code generation using the SymPy library. We evaluate SymCode on challenging benchmarks, including MATH-500 and OlympiadBench, demonstrating significant accuracy improvements of up to 13.6 percentage points over baselines. Our analysis shows that SymCode is not only more token-efficient but also fundamentally shifts model failures from opaque logical fallacies towards transparent, programmatic errors. By grounding LLM reasoning in a deterministic symbolic engine, SymCode represents a key step towards more accurate and trustworthy AI in formal domains.

[20] NeuronMM: High-Performance Matrix Multiplication for LLM Inference on AWS Trainium

Dinghong Song, Jierui Xu, Weichu Yang, Pengfei Su, Dong Li

Main category: cs.CL

TL;DR: The paper presents high-performance matrix multiplication optimizations for LLM inference on AWS Trainium AI accelerators, achieving significant speedups over existing implementations.

DetailsMotivation: AWS Trainium AI accelerators offer cost-effective solutions for LLM workloads but present challenges due to their systolic array architecture and data layout requirements, making high-performance optimization difficult.

Method: Developed techniques including kernel fusion and novel caching strategies to reduce data movement across software-managed memory hierarchy, maximize SRAM bandwidth, and avoid expensive matrix transpose operations.

Result: Achieved average 1.35x speedup (up to 2.22x) at matmul kernel level and average 1.66x speedup (up to 2.49x) for end-to-end LLM inference across nine datasets and four recent LLMs.

Conclusion: The proposed optimizations successfully overcome Trainium’s architectural challenges and significantly outperform AWS’s state-of-the-art matmul implementation, demonstrating effective techniques for AI accelerator optimization.

Abstract: AI accelerators, customized to AI workloads, provide cost-effective and high-performance solutions for training and inference. Trainium, an AI accelerator recently developed by Amazon Web Services (AWS), provides an attractive option for LLM training and inference through its heterogeneous architecture. However, leveraging Trainium architecture for high performance can be challenging because of its systolic array architecture and special requirement on data layout. In this paper, we design high-performance matrix multiplication (matmul), a critical compute kernel, for LLM inference on Trainium. We introduce a series of techniques customized to Trainium based on kernel fusion and novel caching strategies to reduce data movement across the software-managed memory hierarchy, maximize SRAM bandwidth, and avoid expensive matrix transpose. Evaluating with nine datasets and four recent LLMs, we show that our system largely outperforms the state-of-the-art matmul implemented by AWS on Trainium: at the level of matmul kernel, it achieves an average 1.35x speedup (up to 2.22x), which translates to an average 1.66x speedup (up to 2.49x) for end-to-end LLM inference.

[21] AttnCache: Accelerating Self-Attention Inference for LLM Prefill via Attention Cache

Dinghong Song, Yuan Feng, Yiwei Wang, Shangye Chen, Cyril Guyot, Filip Blagojevic, Hyeran Jeon, Pengfei Su, Dong Li

Main category: cs.CL

TL;DR: AttnCache accelerates LLM prefill inference by caching and reusing similar attention maps, reducing computational overhead of self-attention with minimal accuracy loss.

DetailsMotivation: Many real-world workloads rely on prefill-only inference where self-attention becomes the performance bottleneck due to quadratic complexity with sequence length.

Method: Proposes AttnCache framework that uses attention map memorization database with efficient caching and similarity search to retrieve and reuse similar attention maps during inference.

Result: Achieves 1.2x end-to-end and 2x attention speedup on CPU, and 1.6x end-to-end and 3x attention speedup on GPU with negligible accuracy degradation.

Conclusion: AttnCache effectively accelerates prefill stage of LLM inference by leveraging attention map similarity across different inputs.

Abstract: Large Language Models (LLMs) are widely used in generative applications such as chatting, code generation, and reasoning. However, many realworld workloads such as classification, question answering, recommendation, and text embedding rely solely on the prefill stage of inference, where the model encodes input sequences without performing autoregressive decoding. In these prefill only scenarios, the self-attention computation becomes the primary performance bottleneck due to its quadratic complexity with respect to sequence length. In this paper, we observe that semantically different sentences often produce similar attention maps across layers and heads. Building on this insight, we propose AttnCache, a framework that accelerates the prefill stage of LLM inference by retrieving and reusing similar attention maps. Based on an attention map memorization database, AttnCache employs efficient caching and similarity search techniques to identify and reuse pre-cached attention maps during inference, thereby reducing the computational overhead of self-attention. Experimental results show that AttnCache achieves an average of 1.2x end-to-end and 2x attention speedup on CPU, and 1.6x end-to-end and 3x attention speedup on GPU, with negligible accuracy degradation.

[22] Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning

Yihe Deng, I-Hung Hsu, Jun Yan, Zifeng Wang, Rujun Han, Gufeng Zhang, Yanfei Chen, Wei Wang, Tomas Pfister, Chen-Yu Lee

Main category: cs.CL

TL;DR: SRL is a new training framework that combines supervised learning and reinforcement learning to improve multi-step reasoning in small LLMs by generating logical action sequences with internal reasoning monologues.

DetailsMotivation: LLMs struggle with multi-step reasoning tasks. SFT overfits to demonstrations through rigid imitation, while RLVR fails when correct solutions are rarely sampled. There's a need for a method that provides richer learning signals even with incorrect rollouts.

Method: SRL reformulates problem solving as generating sequences of logical actions. It trains models to generate internal reasoning monologues before each action, providing step-wise rewards based on similarity between model actions and expert actions from SFT dataset.

Result: SRL enables small models to learn challenging problems previously unlearnable by SFT or RLVR. Initializing with SRL before RLVR yields strongest performance. SRL also generalizes effectively to agentic software engineering tasks.

Conclusion: SRL is a robust and versatile training framework for reasoning-oriented LLMs that bridges the gap between SFT and RLVR, enabling better multi-step reasoning capabilities in small models.

Abstract: Large Language Models (LLMs) often struggle with problems that require multi-step reasoning. For small-scale open-source models, Reinforcement Learning with Verifiable Rewards (RLVR) fails when correct solutions are rarely sampled even after many attempts, while Supervised Fine-Tuning (SFT) tends to overfit long demonstrations through rigid token-by-token imitation. To address this gap, we propose Supervised Reinforcement Learning (SRL), a framework that reformulates problem solving as generating a sequence of logical “actions”. SRL trains the model to generate an internal reasoning monologue before committing to each action. It provides smoother rewards based on the similarity between the model’s actions and expert actions extracted from the SFT dataset in a step-wise manner. This supervision offers richer learning signals even when all rollouts are incorrect, while encouraging flexible reasoning guided by expert demonstrations. As a result, SRL enables small models to learn challenging problems previously unlearnable by SFT or RLVR. Moreover, initializing training with SRL before refining with RLVR yields the strongest overall performance. Beyond reasoning benchmarks, SRL generalizes effectively to agentic software engineering tasks, establishing it as a robust and versatile training framework for reasoning-oriented LLMs.

[23] PORTool: Tool-Use LLM Training with Rewarded Tree

Feijie Wu, Weiwu Zhu, Yuxiang Zhang, Soumya Chatterjee, Jiarong Zhu, Fan Mo, Rodin Luo, Jing Gao

Main category: cs.CL

TL;DR: PORTool is a reinforcement learning method that trains tool-use LLMs to explore multiple solution trajectories in dynamic tool environments, improving accuracy and efficiency.

DetailsMotivation: Current tool-use LLMs trained on static datasets fail to explore alternative solutions and perform poorly in dynamic tool-call environments, limiting their reasoning capabilities.

Method: Uses RL to generate multiple rollouts forming tree-like structures, assigns step-wise rewards based on answer correctness and tool-call success, and calculates fork-relative advantages blended with trajectory-relative advantages for training.

Result: Significant improvements in final accuracy and reduction in tool-call steps compared to other training approaches, validated through ablation studies with 17 tools covering time-sensitive and invariant topics.

Conclusion: PORTool effectively enhances tool-use LLMs’ exploration capabilities and performance in dynamic environments through structured RL training with step-wise reward mechanisms.

Abstract: Current tool-use large language models (LLMs) are trained on static datasets, enabling them to interact with external tools and perform multi-step, tool-integrated reasoning, which produces tool-call trajectories. However, these models imitate how a query is resolved in a generic tool-call routine, thereby failing to explore possible solutions and demonstrating limited performance in an evolved, dynamic tool-call environment. In this work, we propose PORTool, a reinforcement learning (RL) method that encourages a tool-use LLM to explore various trajectories yielding the correct answer. Specifically, this method starts with generating multiple rollouts for a given query, and some of them share the first few tool-call steps, thereby forming a tree-like structure. Next, we assign rewards to each step, based on its ability to produce a correct answer and make successful tool calls. A shared step across different trajectories receives the same reward, while different steps under the same fork receive different rewards. Finally, these step-wise rewards are used to calculate fork-relative advantages, blended with trajectory-relative advantages, to train the LLM for tool use. The experiments utilize 17 tools to address user queries, covering both time-sensitive and time-invariant topics. We conduct ablation studies to systematically justify the necessity and the design robustness of step-wise rewards. Furthermore, we compare the proposed PORTool with other training approaches and demonstrate significant improvements in final accuracy and the number of tool-call steps.

[24] Rethinking Cross-lingual Alignment: Balancing Transfer and Cultural Erasure in Multilingual LLMs

HyoJung Han, Sweta Agrawal, Eleftheria Briakou

Main category: cs.CL

TL;DR: Cross-lingual alignment in LLMs causes cultural erasure while improving factual transfer. The paper introduces a framework to quantify this trade-off and proposes Surgical Steering, a layer-specific activation steering method that better balances factual transfer and cultural localization.

DetailsMotivation: Current cross-lingual alignment approaches inadvertently cause 'cultural erasure' - the loss of culturally-situated responses that should differ based on query language, while only improving factual knowledge transfer.

Method: Introduced a transfer-localization plane evaluation framework, analyzed CLA approaches, and proposed Surgical Steering - an inference-time method that applies targeted activation steering to different model layers to disentangle factual transfer from culturally-specific knowledge.

Result: Found that CLA approaches consistently improve factual transfer at the direct cost of cultural localization across six languages. Identified that universal factual transfer and culturally-specific knowledge are optimally steerable at different model layers.

Conclusion: Surgical Steering effectively overcomes limitations of current alignment techniques by achieving better balance between factual transfer and cultural localization through layer-specific activation steering.

Abstract: Cross-lingual alignment (CLA) aims to align multilingual representations, enabling Large Language Models (LLMs) to seamlessly transfer knowledge across languages. While intuitive, we hypothesize, this pursuit of representational convergence can inadvertently cause “cultural erasure”, the functional loss of providing culturally-situated responses that should diverge based on the query language. In this work, we systematically analyze this trade-off by introducing a holistic evaluation framework, the transfer-localization plane, which quantifies both desirable knowledge transfer and undesirable cultural erasure. Using this framework, we re-evaluate recent CLA approaches and find that they consistently improve factual transfer at the direct cost of cultural localization across all six languages studied. Our investigation into the internal representations of these models reveals a key insight: universal factual transfer and culturally-specific knowledge are optimally steerable at different model layers. Based on this finding, we propose Surgical Steering, a novel inference-time method that disentangles these two objectives. By applying targeted activation steering to distinct layers, our approach achieves a better balance between the two competing dimensions, effectively overcoming the limitations of current alignment techniques.

[25] Artificial Intelligence-Enabled Analysis of Radiology Reports: Epidemiology and Consequences of Incidental Thyroid Findings

Felipe Larios, Mariana Borras-Osorio, Yuqi Wu, Ana Gabriela Claros, David Toro-Tobon, Esteban Cabezas, Ricardo Loor-Torres, Maria Mateo Chavez, Kerly Guevara Maldonado, Luis Vilatuna Andrango, Maria Lizarazo Jimenez, Ivan Mateo Alzamora, Misk Al Zahidy, Marcelo Montero, Ana Cristina Proano, Cristian Soto Jacome, Jungwei W. Fan, Oscar J. Ponce-Ponte, Megan E. Branda, Naykky Singh Ospina, Juan P. Brito

Main category: cs.CL

TL;DR: Developed NLP pipeline to identify incidental thyroid findings in radiology reports, finding 7.8% prevalence with strong association to thyroid cancer diagnosis cascades.

DetailsMotivation: Incidental thyroid findings are increasingly detected but their prevalence, features, and clinical consequences remain undefined.

Method: Retrospective cohort study using transformer-based NLP pipeline to identify ITFs from radiology reports across multiple imaging modalities.

Result: Among 115,683 patients, 7.8% had ITFs (92.9% nodules). ITFs were associated with higher odds of thyroid nodule diagnosis, biopsy, thyroidectomy and thyroid cancer. Most cancers were papillary and larger when detected after ITFs.

Conclusion: ITFs are common and strongly associated with cascades leading to detection of small, low-risk cancers, highlighting role in thyroid cancer overdiagnosis and need for standardized reporting and selective follow-up.

Abstract: Importance Incidental thyroid findings (ITFs) are increasingly detected on imaging performed for non-thyroid indications. Their prevalence, features, and clinical consequences remain undefined. Objective To develop, validate, and deploy a natural language processing (NLP) pipeline to identify ITFs in radiology reports and assess their prevalence, features, and clinical outcomes. Design, Setting, and Participants Retrospective cohort of adults without prior thyroid disease undergoing thyroid-capturing imaging at Mayo Clinic sites from July 1, 2017, to September 30, 2023. A transformer-based NLP pipeline identified ITFs and extracted nodule characteristics from image reports from multiple modalities and body regions. Main Outcomes and Measures Prevalence of ITFs, downstream thyroid ultrasound, biopsy, thyroidectomy, and thyroid cancer diagnosis. Logistic regression identified demographic and imaging-related factors. Results Among 115,683 patients (mean age, 56.8 [SD 17.2] years; 52.9% women), 9,077 (7.8%) had an ITF, of which 92.9% were nodules. ITFs were more likely in women, older adults, those with higher BMI, and when imaging was ordered by oncology or internal medicine. Compared with chest CT, ITFs were more likely via neck CT, PET, and nuclear medicine scans. Nodule characteristics were poorly documented, with size reported in 44% and other features in fewer than 15% (e.g. calcifications). Compared with patients without ITFs, those with ITFs had higher odds of thyroid nodule diagnosis, biopsy, thyroidectomy and thyroid cancer diagnosis. Most cancers were papillary, and larger when detected after ITFs vs no ITF. Conclusions ITFs were common and strongly associated with cascades leading to the detection of small, low-risk cancers. These findings underscore the role of ITFs in thyroid cancer overdiagnosis and the need for standardized reporting and more selective follow-up.

[26] QCoder Benchmark: Bridging Language Generation and Quantum Hardware through Simulator-Based Feedback

Taku Mikuriya, Tatsuya Ishigaki, Masayuki Kawarada, Shunya Minami, Tadashi Kadowaki, Yohichi Suzuki, Soshun Naito, Shunya Takata, Takumi Kato, Tamotsu Basseda, Reo Yamada, Hiroya Takamura

Main category: cs.CL

TL;DR: The paper introduces QCoder Benchmark, an evaluation framework for assessing LLMs on quantum programming tasks with hardware feedback, showing that reasoning-based models significantly outperform standard LLMs and human coders.

DetailsMotivation: There is a gap in evaluating LLMs for programming domains requiring hardware interaction, particularly quantum programming where code is executed on quantum computers.

Method: Created QCoder Benchmark with quantum simulator environment for domain-specific metrics (circuit depth, execution time, error classification) and incorporated human-written code from programming contests for comparison.

Result: Advanced models like GPT-4o achieved only 18.97% accuracy, while reasoning-based models like o3 reached 78% accuracy, outperforming human-written code success rates (39.98%).

Conclusion: QCoder Benchmark provides a comprehensive evaluation framework for quantum programming tasks, demonstrating the superiority of reasoning-based models and the need for specialized benchmarks in hardware-interactive domains.

Abstract: Large language models (LLMs) have increasingly been applied to automatic programming code generation. This task can be viewed as a language generation task that bridges natural language, human knowledge, and programming logic. However, it remains underexplored in domains that require interaction with hardware devices, such as quantum programming, where human coders write Python code that is executed on a quantum computer. To address this gap, we introduce QCoder Benchmark, an evaluation framework that assesses LLMs on quantum programming with feedback from simulated hardware devices. Our benchmark offers two key features. First, it supports evaluation using a quantum simulator environment beyond conventional Python execution, allowing feedback of domain-specific metrics such as circuit depth, execution time, and error classification, which can be used to guide better generation. Second, it incorporates human-written code submissions collected from real programming contests, enabling both quantitative comparisons and qualitative analyses of LLM outputs against human-written codes. Our experiments reveal that even advanced models like GPT-4o achieve only around 18.97% accuracy, highlighting the difficulty of the benchmark. In contrast, reasoning-based models such as o3 reach up to 78% accuracy, outperforming averaged success rates of human-written codes (39.98%). We release the QCoder Benchmark dataset and public evaluation API to support further research.

[27] Reasoning Path Divergence: A New Metric and Curation Strategy to Unlock LLM Diverse Thinking

Feng Ju, Zeyu Qin, Rui Min, Zhitao He, Lingpeng Kong, Yi R. Fung

Main category: cs.CL

TL;DR: The paper proposes 1PNS (one problem, multiple solutions) training paradigm to increase output diversity in LLMs, introduces Reasoning Path Divergence (RPD) metric to measure semantic differences in reasoning paths, and shows improved performance over traditional 1P1S training.

DetailsMotivation: Test-Time Scaling (TTS) improves LLM reasoning but suffers from low output diversity due to 'one problem, one solution' (1P1S) training that pushes models toward narrow reasoning paths.

Method: Propose 1PNS training paradigm with Reasoning Path Divergence (RPD) metric to measure semantic differences between multi-step chains of thought, curate diverse solution sets, and fine-tune Qwen3-4B-Base model.

Result: RPD-selected training yields more varied outputs and higher pass@k, with +2.80% average gain in pass@16 over 1P1S baseline and +4.99% gain on AIME24, demonstrating 1PNS amplifies TTS effectiveness.

Conclusion: 1PNS training paradigm with RPD metric successfully increases output diversity and improves reasoning performance, showing that exposing models to varied reasoning paths enhances Test-Time Scaling effectiveness.

Abstract: While Test-Time Scaling (TTS) has proven effective in improving the reasoning ability of large language models (LLMs), low diversity in model outputs often becomes a bottleneck; this is partly caused by the common “one problem, one solution” (1P1S) training practice, which provides a single canonical answer and can push models toward a narrow set of reasoning paths. To address this, we propose a “one problem, multiple solutions” (1PNS) training paradigm that exposes the model to a variety of valid reasoning trajectories and thus increases inference diversity. A core challenge for 1PNS is reliably measuring semantic differences between multi-step chains of thought, so we introduce Reasoning Path Divergence (RPD), a step-level metric that aligns and scores Long Chain-of-Thought solutions to capture differences in intermediate reasoning. Using RPD, we curate maximally diverse solution sets per problem and fine-tune Qwen3-4B-Base. Experiments show that RPD-selected training yields more varied outputs and higher pass@k, with an average +2.80% gain in pass@16 over a strong 1P1S baseline and a +4.99% gain on AIME24, demonstrating that 1PNS further amplifies the effectiveness of TTS. Our code is available at https://github.com/fengjujf/Reasoning-Path-Divergence .

[28] On the Influence of Discourse Relations in Persuasive Texts

Nawar Turk, Sevag Kaspar, Leila Kosseim

Main category: cs.CL

TL;DR: This paper analyzes the relationship between persuasion techniques and discourse relations using LLMs, creating silver datasets that reveal six key discourse relations crucial for persuasive communication.

DetailsMotivation: To understand how persuasion techniques relate to discourse structures, especially since no existing dataset contains both annotations, with applications in detecting online propaganda and misinformation.

Method: Used SemEval 2023 Task 3 dataset with 19 PTs, developed LLM-based classifiers with 4 LLMs and 10 prompts to label 22 PDTB 3.0 discourse relations, created ensemble models with majority-pooling for silver datasets.

Result: Created 5 silver datasets (1281-204 instances) showing six discourse relations (Cause, Purpose, Contrast, Cause+Belief, Concession, Condition) are crucial for persuasion techniques like Loaded Language, Exaggeration/Minimisation, Repetition, and Doubt.

Conclusion: The identified discourse relations play vital roles in persuasive texts, providing insights for detecting propaganda and understanding effective communication patterns.

Abstract: This paper investigates the relationship between Persuasion Techniques (PTs) and Discourse Relations (DRs) by leveraging Large Language Models (LLMs) and prompt engineering. Since no dataset annotated with both PTs and DRs exists, we took the SemEval 2023 Task 3 dataset labelled with 19 PTs as a starting point and developed LLM-based classifiers to label each instance of the dataset with one of the 22 PDTB 3.0 level-2 DRs. In total, four LLMs were evaluated using 10 different prompts, resulting in 40 unique DR classifiers. Ensemble models using different majority-pooling strategies were used to create 5 silver datasets of instances labelled with both persuasion techniques and level-2 PDTB senses. The silver dataset sizes vary from 1,281 instances to 204 instances, depending on the majority pooling technique used. Statistical analysis of these silver datasets shows that six discourse relations (namely Cause, Purpose, Contrast, Cause+Belief, Concession, and Condition) play a crucial role in persuasive texts, especially in the use of Loaded Language, Exaggeration/Minimisation, Repetition and to cast Doubt. This insight can contribute to detecting online propaganda and misinformation, as well as to our general understanding of effective communication.

[29] MossNet: Mixture of State-Space Experts is a Multi-Head Attention

Shikhar Tuli, James Seale Smith, Haris Jeelani, Chi-Heng Lin, Abhishek Patel, Vasili Ramanishka, Yen-Chang Hsu, Hongxia Jin

Main category: cs.CL

TL;DR: MossNet is a novel mixture-of-state-space-experts architecture that emulates multi-head attention using MoE in both MLP blocks and SSM kernels, outperforming transformers and SSMs in efficiency and performance.

DetailsMotivation: Current SSM/GRM-based methods often emulate only single attention head, limiting expressiveness. There's a need for more expressive recurrent architectures that can compete with transformers while maintaining efficiency.

Method: Proposes MossNet with mixture-of-experts implementation in both channel-mixing MLP blocks and time-mixing SSM kernels to realize multiple attention heads, creating a linear multi-head attention equivalent.

Result: Outperforms both transformer- and SSM-based architectures of similar size and data budgets. Larger variants trained on trillions of tokens confirm scalability. Shows favorable runtime speed and resource usage on mobile devices and GPUs.

Conclusion: MossNet represents a compelling new direction for efficient, high-performing recurrent LLM architectures that combine the benefits of SSMs with multi-head attention expressiveness.

Abstract: Large language models (LLMs) have significantly advanced generative applications in natural language processing (NLP). Recent trends in model architectures revolve around efficient variants of transformers or state-space/gated-recurrent models (SSMs, GRMs). However, prevailing SSM/GRM-based methods often emulate only a single attention head, potentially limiting their expressiveness. In this work, we propose MossNet, a novel mixture-of-state-space-experts architecture that emulates a linear multi-head attention (MHA). MossNet leverages a mixture-of-experts (MoE) implementation not only in channel-mixing multi-layered perceptron (MLP) blocks but also in the time-mixing SSM kernels to realize multiple “attention heads.” Extensive experiments on language modeling and downstream evaluations show that MossNet outperforms both transformer- and SSM-based architectures of similar model size and data budgets. Larger variants of MossNet, trained on trillions of tokens, further confirm its scalability and superior performance. In addition, real-device profiling on a Samsung Galaxy S24 Ultra and an Nvidia A100 GPU demonstrate favorable runtime speed and resource usage compared to similarly sized baselines. Our results suggest that MossNet is a compelling new direction for efficient, high-performing recurrent LLM architectures.

[30] Similarity-Distance-Magnitude Language Models

Allen Schmaltz

Main category: cs.CL

TL;DR: SDM language models are fine-tuned to maximize generations in well-calibrated high-probability regions using a final-layer SDM activation layer for binary classification of instruction-following.

DetailsMotivation: To improve statistical efficiency by reducing abstentions in language models while maintaining well-calibrated, high-probability generations.

Method: Convert existing pre-trained decoder-only Transformer LMs into SDM LMs via supervised fine-tuning using a final-layer SDM activation layer, contrastive input encoding, and online hard negative generation.

Result: SDM LMs achieve reduced abstentions compared to strong supervised baselines while maintaining instruction-following capability.

Conclusion: SDM fine-tuning effectively improves statistical efficiency in language models by optimizing for well-calibrated high-probability generations through contrastive training and SDM activation layers.

Abstract: We introduce Similarity-Distance-Magnitude (SDM) language models (LMs), which are sequence prediction models fine-tuned to maximize the proportion of generations in the well-calibrated, high-probability region partitioned by a final-layer SDM activation layer used for binary classification of instruction-following. We demonstrate that existing pre-trained decoder-only Transformer LMs can be readily converted into SDM LMs via supervised fine-tuning, using the final-layer SDM activation layer during training to estimate a change-of-base for a supervised next-token loss over a contrastive input encoding scheme, with additional hard negative examples generated online during training. This results in reduced abstentions (i.e., improved statistical efficiency) compared to strong supervised baselines.

[31] RCScore: Quantifying Response Consistency in Large Language Models

Dongjun Jang, Youngchae Ahn, Hyopil Shin

Main category: cs.CL

TL;DR: RCScore is a framework that quantifies how instruction style affects LLM performance, revealing up to 16.7% accuracy variations across different instruction formulations.

DetailsMotivation: Current LLM evaluations use single instruction templates, ignoring models' sensitivity to instruction style, which is crucial for real-world deployment reliability.

Method: Systematically transform benchmark problems into multiple instruction styles and introduce Cross-Response Similarity (CRS) to measure stylistic self-consistency.

Result: Experiments across 10 LLMs on 4 reasoning benchmarks show instruction style can shift accuracy by up to 16.7%. CRS strongly correlates with task accuracy, and deterministic decoding produces more stylistically stable outputs.

Conclusion: RCScore provides a principled approach to assess instruction robustness, with consistency serving as a valuable proxy for model reliability.

Abstract: Current LLM evaluations often rely on a single instruction template, overlooking models’ sensitivity to instruction style-a critical aspect for real-world deployments. We present RCScore, a multi-dimensional framework quantifying how instruction formulation affects model responses. By systematically transforming benchmark problems into multiple instruction styles, RCScore reveals performance variations undetected by conventional metrics. Our experiments across ten LLMs on four reasoning benchmarks demonstrate that instruction style can shift accuracy by up to 16.7% points. We introduce Cross-Response Similarity (CRS), a method applying RCScore metrics to measure stylistic self-consistency, and establish its strong correlation with task accuracy, suggesting consistency as a valuable proxy for model reliability. Additional findings show that deterministic decoding produces more stylistically stable outputs, and model scale correlates positively with cross-style consistency. RCScore offers a principled approach to assess instruction robustness.

[32] Don’t Let It Fade: Preserving Edits in Diffusion Language Models via Token Timestep Allocation

Woojin Kim, Jaeyoung Do

Main category: cs.CL

TL;DR: Token Timestep Allocation (TTA) addresses update forgetting in diffusion language models by implementing per-token timestep schedules, improving controllability and fluency without training.

DetailsMotivation: Diffusion language models suffer from update forgetting where uniform updates erase earlier semantic edits, degrading fluency and coherence during text generation.

Method: Proposes TTA with per-token timestep schedules that freeze critical tokens early while continuing to refine uncertain tokens, implemented as fixed or adaptive policies at inference time.

Result: Improves sentiment control accuracy by >20%, halves perplexity using <1/5 steps; reduces maximum toxicity (12.2 vs 14.5) and perplexity (26.0 vs 32.0) in detoxification.

Conclusion: Softened ordering via timestep allocation is crucial for mitigating update forgetting and achieving stable, controllable diffusion text generation across various DLMs.

Abstract: While diffusion language models (DLMs) enable fine-grained refinement, their practical controllability remains fragile. We identify and formally characterize a central failure mode called update forgetting, in which uniform and context agnostic updates induce token level fluctuations across timesteps, erasing earlier semantic edits and disrupting the cumulative refinement process, thereby degrading fluency and coherence. As this failure originates in uniform and context agnostic updates, effective control demands explicit token ordering. We propose Token Timestep Allocation (TTA), which realizes soft and semantic token ordering via per token timestep schedules: critical tokens are frozen early, while uncertain tokens receive continued refinement. This timestep based ordering can be instantiated as either a fixed policy or an adaptive policy driven by task signals, thereby supporting a broad spectrum of refinement strategies. Because it operates purely at inference time, it applies uniformly across various DLMs and naturally extends to diverse supervision sources. Empirically, TTA improves controllability and fluency: on sentiment control, it yields more than 20 percent higher accuracy and nearly halves perplexity using less than one fifth the steps; in detoxification, it lowers maximum toxicity (12.2 versus 14.5) and perplexity (26.0 versus 32.0). Together, these results demonstrate that softened ordering via timestep allocation is the critical lever for mitigating update forgetting and achieving stable and controllable diffusion text generation.

[33] What’s In My Human Feedback? Learning Interpretable Descriptions of Preference Data

Rajiv Movva, Smitha Milli, Sewon Min, Emma Pierson

Main category: cs.CL

TL;DR: WIMHF is a method that uses sparse autoencoders to automatically extract human-interpretable features from human feedback data, revealing what preferences datasets measure and what annotators actually express, enabling better data curation and personalization.

DetailsMotivation: Human feedback can unpredictably alter language models, and practitioners lack understanding of what feedback data encodes. Current methods require pre-specifying attributes, making automatic feature extraction challenging.

Method: WIMHF uses sparse autoencoders to explain feedback data, characterizing both measurable preferences and actual annotator expressions across 7 datasets.

Result: Identified small number of interpretable features accounting for majority of preference prediction signal. Revealed diverse human preferences and dataset context effects. Enabled effective data curation (+37% safety gains) and fine-grained personalization with annotator-specific weights.

Conclusion: WIMHF provides a human-centered analysis method for practitioners to better understand and use preference data, addressing the challenge of unpredictable model alterations from human feedback.

Abstract: Human feedback can alter language models in unpredictable and undesirable ways, as practitioners lack a clear understanding of what feedback data encodes. While prior work studies preferences over certain attributes (e.g., length or sycophancy), automatically extracting relevant features without pre-specifying hypotheses remains challenging. We introduce What’s In My Human Feedback? (WIMHF), a method to explain feedback data using sparse autoencoders. WIMHF characterizes both (1) the preferences a dataset is capable of measuring and (2) the preferences that the annotators actually express. Across 7 datasets, WIMHF identifies a small number of human-interpretable features that account for the majority of the preference prediction signal achieved by black-box models. These features reveal a wide diversity in what humans prefer, and the role of dataset-level context: for example, users on Reddit prefer informality and jokes, while annotators in HH-RLHF and PRISM disprefer them. WIMHF also surfaces potentially unsafe preferences, such as that LMArena users tend to vote against refusals, often in favor of toxic content. The learned features enable effective data curation: re-labeling the harmful examples in Arena yields large safety gains (+37%) with no cost to general performance. They also allow fine-grained personalization: on the Community Alignment dataset, we learn annotator-specific weights over subjective features that improve preference prediction. WIMHF provides a human-centered analysis method for practitioners to better understand and use preference data.

[34] Towards Global Retrieval Augmented Generation: A Benchmark for Corpus-Level Reasoning

Qi Luo, Xiaonan Li, Tingshuo Fan, Xinchi Chen, Xipeng Qiu

Main category: cs.CL

TL;DR: The paper introduces GlobalQA, the first benchmark for evaluating global RAG capabilities, and proposes GlobalRAG framework that significantly outperforms existing methods on global information aggregation tasks.

DetailsMotivation: Current RAG evaluation focuses on local retrieval (finding relevant chunks from small document subsets), but real-world applications require global RAG capabilities - aggregating information across entire document collections to derive corpus-level insights.

Method: Proposed GlobalRAG framework with: 1) chunk-level retrieval preserving structural coherence, 2) LLM-driven intelligent filters to eliminate noisy documents, 3) aggregation modules for precise symbolic computation.

Result: Existing RAG methods perform poorly on global tasks (strongest baseline: 1.51 F1). GlobalRAG achieves 6.63 F1 on Qwen2.5-14B model, representing significant improvement over baselines.

Conclusion: GlobalRAG effectively addresses the limitations of current RAG methods for global information aggregation tasks, demonstrating substantial performance gains on the new GlobalQA benchmark.

Abstract: Retrieval-augmented generation (RAG) has emerged as a leading approach to reducing hallucinations in large language models (LLMs). Current RAG evaluation benchmarks primarily focus on what we call local RAG: retrieving relevant chunks from a small subset of documents to answer queries that require only localized understanding within specific text chunks. However, many real-world applications require a fundamentally different capability – global RAG – which involves aggregating and analyzing information across entire document collections to derive corpus-level insights (for example, “What are the top 10 most cited papers in 2023?”). In this paper, we introduce GlobalQA – the first benchmark specifically designed to evaluate global RAG capabilities, covering four core task types: counting, extremum queries, sorting, and top-k extraction. Through systematic evaluation across different models and baselines, we find that existing RAG methods perform poorly on global tasks, with the strongest baseline achieving only 1.51 F1 score. To address these challenges, we propose GlobalRAG, a multi-tool collaborative framework that preserves structural coherence through chunk-level retrieval, incorporates LLM-driven intelligent filters to eliminate noisy documents, and integrates aggregation modules for precise symbolic computation. On the Qwen2.5-14B model, GlobalRAG achieves 6.63 F1 compared to the strongest baseline’s 1.51 F1, validating the effectiveness of our method.

[35] Pragmatic Theories Enhance Understanding of Implied Meanings in LLMs

Takuma Sato, Seiya Kawano, Koichiro Yoshino

Main category: cs.CL

TL;DR: Using pragmatic theories as prompts improves language models’ ability to understand implied meanings, achieving up to 9.6% performance gains on pragmatic reasoning tasks.

DetailsMotivation: Language models need to accurately interpret implied meanings for effective human communication, but current approaches may not sufficiently leverage pragmatic theories.

Method: Proposed approach presents pragmatic theories (Gricean pragmatics, Relevance Theory) as prompts to guide language models through step-by-step reasoning for interpretation.

Result: Methods achieved up to 9.6% higher scores on pragmatic reasoning tasks compared to baseline (0-shot Chain-of-Thought), with even just mentioning theory names providing 1-3% improvement in larger models.

Conclusion: Providing pragmatic theories as prompts is an effective in-context learning approach for enhancing language models’ pragmatic reasoning capabilities.

Abstract: The ability to accurately interpret implied meanings plays a crucial role in human communication and language use, and language models are also expected to possess this capability. This study demonstrates that providing language models with pragmatic theories as prompts is an effective in-context learning approach for tasks to understand implied meanings. Specifically, we propose an approach in which an overview of pragmatic theories, such as Gricean pragmatics and Relevance Theory, is presented as a prompt to the language model, guiding it through a step-by-step reasoning process to derive a final interpretation. Experimental results showed that, compared to the baseline, which prompts intermediate reasoning without presenting pragmatic theories (0-shot Chain-of-Thought), our methods enabled language models to achieve up to 9.6% higher scores on pragmatic reasoning tasks. Furthermore, we show that even without explaining the details of pragmatic theories, merely mentioning their names in the prompt leads to a certain performance improvement (around 1-3%) in larger models compared to the baseline.

[36] Language Models Are Borrowing-Blind: A Multilingual Evaluation of Loanword Identification across 10 Languages

Mérilin Sousa Silva, Sina Ahmadi

Main category: cs.CL

TL;DR: Pretrained language models perform poorly at identifying loanwords despite contextual information, showing bias toward loanwords over native equivalents.

DetailsMotivation: To investigate whether pretrained language models can differentiate loanwords from native vocabulary, especially in bilingual communities where dominant languages impose lexical items on minority languages.

Method: Evaluated multiple pretrained language models (including large language models) across 10 languages using explicit instructions and contextual information for loanword identification.

Result: Models performed poorly in distinguishing loanwords from native ones, corroborating previous evidence of NLP system bias toward loanwords rather than native equivalents.

Conclusion: The findings have implications for developing NLP tools for minority languages and supporting language preservation in communities facing lexical pressure from dominant languages.

Abstract: Throughout language history, words are borrowed from one language to another and gradually become integrated into the recipient’s lexicon. Speakers can often differentiate these loanwords from native vocabulary, particularly in bilingual communities where a dominant language continuously imposes lexical items on a minority language. This paper investigates whether pretrained language models, including large language models, possess similar capabilities for loanword identification. We evaluate multiple models across 10 languages. Despite explicit instructions and contextual information, our results show that models perform poorly in distinguishing loanwords from native ones. These findings corroborate previous evidence that modern NLP systems exhibit a bias toward loanwords rather than native equivalents. Our work has implications for developing NLP tools for minority languages and supporting language preservation in communities under lexical pressure from dominant languages.

[37] Distilling Multilingual Vision-Language Models: When Smaller Models Stay Multilingual

Sukrit Sriratanawilai, Jhayahgrit Thongwat, Romrawin Chumpu, Patomporn Payoungkhamdee, Sarana Nutanong, Peerat Limkonchotiwat

Main category: cs.CL

TL;DR: Knowledge distillation can preserve or improve multilingual retrieval robustness in compressed vision-language models, but shows design-sensitive trade-offs that aggregate accuracy alone doesn’t reveal.

DetailsMotivation: Vision-language models exhibit uneven performance across languages, especially when model size is reduced, and knowledge distillation's behavior in multilingual contexts is underexplored.

Method: Controlled empirical study of five distillation approaches across CLIP and SigLIP2 models, evaluating effects on cross-lingual representation consistency and downstream performance under model compression.

Result: Some distillation configurations preserve or improve multilingual retrieval robustness despite halving model size, but others fail to maintain cross-task stability.

Conclusion: Knowledge distillation shows promise for multilingual model compression but requires careful design choices due to trade-offs that aggregate accuracy metrics alone don’t capture.

Abstract: Vision-language models (VLMs) exhibit uneven performance across languages, a problem that is often exacerbated when the model size is reduced. While Knowledge distillation (KD) demonstrates promising results in transferring knowledge from larger to smaller VLMs, applying KD in multilingualism is an underexplored area. This paper presents a controlled empirical study of KD behavior across five distillation approaches, isolating their effects on cross-lingual representation consistency and downstream performance stability under model compression. We study five distillation formulations across CLIP and SigLIP2, and evaluate them on in-domain retrieval and out-of-domain visual QA. We find that some configurations preserve or even improve multilingual retrieval robustness despite halving model size, but others fail to maintain cross-task stability, exposing design-sensitive trade-offs that aggregate accuracy alone does not reveal.

[38] Do LLMs Signal When They’re Right? Evidence from Neuron Agreement

Kang Chen, Yaoning Wang, Kai Xiong, Zhuoka Feng, Wenhe Sun, Haotian Chen, Yixin Cao

Main category: cs.CL

TL;DR: Neuron Agreement Decoding (NAD) is an unsupervised ensemble method that uses internal neuron activations instead of external signals to select best reasoning candidates, enabling early correctness prediction and 99% token reduction.

DetailsMotivation: Current ensemble decoding methods rely on poorly calibrated external signals like token probabilities and self-evaluations, which are limited projections of richer internal neural dynamics.

Method: NAD analyzes internal neuron activations to measure sparsity (fewer unique neurons activated for correct responses) and cross-sample agreement (stronger activation patterns for correct vs divergent incorrect responses), then selects candidates based on these internal signals.

Result: NAD matches majority voting on math/science benchmarks, outperforms Avg@64 on open-ended coding tasks, enables early correctness prediction within 32 tokens, and reduces token usage by 99% with minimal quality loss.

Conclusion: Internal neuron activation signals provide reliable, scalable, and efficient guidance for label-free ensemble decoding, outperforming external signal-based methods while enabling aggressive early stopping.

Abstract: Large language models (LLMs) commonly boost reasoning via sample-evaluate-ensemble decoders, achieving label free gains without ground truth. However, prevailing strategies score candidates using only external outputs such as token probabilities, entropies, or self evaluations, and these signals can be poorly calibrated after post training. We instead analyze internal behavior based on neuron activations and uncover three findings: (1) external signals are low dimensional projections of richer internal dynamics; (2) correct responses activate substantially fewer unique neurons than incorrect ones throughout generation; and (3) activations from correct responses exhibit stronger cross sample agreement, whereas incorrect ones diverge. Motivated by these observations, we propose Neuron Agreement Decoding (NAD), an unsupervised best-of-N method that selects candidates using activation sparsity and cross sample neuron agreement, operating solely on internal signals and without requiring comparable textual outputs. NAD enables early correctness prediction within the first 32 generated tokens and supports aggressive early stopping. Across math and science benchmarks with verifiable answers, NAD matches majority voting; on open ended coding benchmarks where majority voting is inapplicable, NAD consistently outperforms Avg@64. By pruning unpromising trajectories early, NAD reduces token usage by 99% with minimal loss in generation quality, showing that internal signals provide reliable, scalable, and efficient guidance for label free ensemble decoding.

[39] Unravelling the Mechanisms of Manipulating Numbers in Language Models

Michal Štefánik, Timothee Mickus, Marek Kadlčík, Bertram Højer, Michal Spiegel, Raúl Vázquez, Aman Sinha, Josef Kuchař, Philipp Mondorf

Main category: cs.CL

TL;DR: LLMs develop accurate and interchangeable number representations despite producing numerical errors in outputs, enabling universal probes to trace error causes to specific layers.

DetailsMotivation: To resolve the conflict between LLMs' accurate internal number representations and their propensity for numerical errors in outputs.

Method: Explored how LLMs manipulate numbers, quantified accuracy bounds, created universal probes for each LLM, and traced information flow through layers.

Result: Found that different LLMs learn systematic, highly accurate, and universal number representations across hidden states and input contexts, allowing error tracing to specific layers.

Conclusion: Provides fundamental understanding of LLM number manipulation and outlines potential for more accurate probing techniques to refine LLM architectures.

Abstract: Recent work has shown that different large language models (LLMs) converge to similar and accurate input embedding representations for numbers. These findings conflict with the documented propensity of LLMs to produce erroneous outputs when dealing with numeric information. In this work, we aim to explain this conflict by exploring how language models manipulate numbers and quantify the lower bounds of accuracy of these mechanisms. We find that despite surfacing errors, different language models learn interchangeable representations of numbers that are systematic, highly accurate and universal across their hidden states and the types of input contexts. This allows us to create universal probes for each LLM and to trace information – including the causes of output errors – to specific layers. Our results lay a fundamental understanding of how pre-trained LLMs manipulate numbers and outline the potential of more accurate probing techniques in addressed refinements of LLMs’ architectures.

[40] Can Agent Conquer Web? Exploring the Frontiers of ChatGPT Atlas Agent in Web Games

Jingran Zhang, Ning Li, Justin Cui

Main category: cs.CL

TL;DR: Early evaluation of OpenAI’s ChatGPT Atlas web interaction capabilities using browser games shows strong performance in logical reasoning tasks like Sudoku but significant struggles in real-time games requiring precise timing and motor control.

DetailsMotivation: To assess Atlas's performance in dynamic, interactive web environments, as its capabilities in information retrieval are known but real-time interaction abilities remain less explored.

Method: Used browser-based games (T-Rex Runner, Sudoku, Flappy Bird, Stein.world) as test scenarios and employed in-game performance scores as quantitative metrics across different task types.

Result: Atlas performs strongly in logical reasoning tasks (completing Sudoku puzzles faster than human baselines) but struggles substantially in real-time games requiring precise timing and motor control, often failing to progress beyond initial obstacles.

Conclusion: While Atlas demonstrates capable analytical processing, there remain notable limitations in dynamic web environments requiring real-time interaction, highlighting the gap between reasoning capabilities and real-time motor control performance.

Abstract: OpenAI’s ChatGPT Atlas introduces new capabilities for web interaction, enabling the model to analyze webpages, process user intents, and execute cursor and keyboard inputs directly within the browser. While its capacity for information retrieval tasks has been demonstrated, its performance in dynamic, interactive environments remains less explored. In this study, we conduct an early evaluation of Atlas’s web interaction capabilities using browser-based games as test scenarios, including Google’s T-Rex Runner, Sudoku, Flappy Bird, and Stein.world. We employ in-game performance scores as quantitative metrics to assess performance across different task types. Our results show that Atlas performs strongly in logical reasoning tasks like Sudoku, completing puzzles significantly faster than human baselines, but struggles substantially in real-time games requiring precise timing and motor control, often failing to progress beyond initial obstacles. These findings suggest that while Atlas demonstrates capable analytical processing, there remain notable limitations in dynamic web environments requiring real-time interaction. The website of our project can be found at https://atlas-game-eval.github.io.

[41] SCRIBE: Structured Chain Reasoning for Interactive Behaviour Explanations using Tool Calling

Fares Fawzi, Vinitra Swamy, Dominik Glandorf, Tanya Nazaretsky, Tanja Käser

Main category: cs.CL

TL;DR: SCRIBE is a framework for generating pedagogically valid student feedback using small language models (3B-8B parameters) that combine tool-augmented reasoning with self-reflective inference, achieving comparable quality to much larger models while addressing privacy and computational constraints in education.

DetailsMotivation: Address three key challenges in deploying language models for educational feedback: privacy concerns, limited computational resources, and the need for pedagogically valid responses that require small, open-source models running locally with reliable grounding in correct information.

Method: SCRIBE framework combines domain-specific tools with a self-reflective inference pipeline supporting iterative reasoning, tool use, and error recovery. Models are distilled via two-stage LoRA fine-tuning on synthetic GPT-4o-generated data.

Result: 8B-SCRIBE models achieve comparable or superior quality to much larger models in relevance and actionability, and are perceived on par with GPT-4o and Llama-3.3 70B by students in user studies with 108 participants.

Conclusion: SCRIBE demonstrates viability for low-resource, privacy-sensitive educational applications by enabling small models to generate high-quality, pedagogically valid feedback through tool-augmented reasoning and self-reflection.

Abstract: Language models can be used to provide interactive, personalized student feedback in educational settings. However, real-world deployment faces three key challenges: privacy concerns, limited computational resources, and the need for pedagogically valid responses. These constraints require small, open-source models that can run locally and reliably ground their outputs in correct information. We introduce SCRIBE, a framework for multi-hop, tool-augmented reasoning designed to generate valid responses to student questions about feedback reports. SCRIBE combines domain-specific tools with a self-reflective inference pipeline that supports iterative reasoning, tool use, and error recovery. We distil these capabilities into 3B and 8B models via two-stage LoRA fine-tuning on synthetic GPT-4o-generated data. Evaluation with a human-aligned GPT-Judge and a user study with 108 students shows that 8B-SCRIBE models achieve comparable or superior quality to much larger models in key dimensions such as relevance and actionability, while being perceived on par with GPT-4o and Llama-3.3 70B by students. These findings demonstrate the viability of SCRIBE for low-resource, privacy-sensitive educational applications.

[42] From Amateur to Master: Infusing Knowledge into LLMs via Automated Curriculum Learning

Nishit Neema, Srinjoy Mukherjee, Sapan Shah, Gokul Ramakrishnan, Ganesh Venkatesh

Main category: cs.CL

TL;DR: ACER transforms generalist LLMs into domain experts by generating synthetic textbook-style curricula with Bloom’s taxonomy-guided QA pairs, achieving significant performance gains in specialized domains while maintaining general capabilities.

DetailsMotivation: LLMs underperform in specialized domains like economics and psychology that require deep, principled understanding, creating a need for methods to bridge this domain gap without sacrificing general capabilities.

Method: ACER synthesizes comprehensive curricula by generating subject tables of contents and Bloom’s taxonomy-guided QA pairs, then uses continual pretraining with interleaved curriculum scheduling across content and cognitive dimensions.

Result: ACER boosts accuracy by 5 percentage points in challenging domains like microeconomics, achieves 3 percentage point macro-average improvement across target domains, prevents catastrophic forgetting, and enables positive cross-domain knowledge transfer (+0.7 points on non-target domains).

Conclusion: ACER provides a scalable and effective approach for closing critical domain gaps in LLMs, enhancing specialized performance while maintaining general reasoning capabilities and facilitating knowledge transfer.

Abstract: Large Language Models (LLMs) excel at general tasks but underperform in specialized domains like economics and psychology, which require deep, principled understanding. To address this, we introduce ACER (Automated Curriculum-Enhanced Regimen) that transforms generalist models into domain experts without sacrificing their broad capabilities. ACER first synthesizes a comprehensive, textbook-style curriculum by generating a table of contents for a subject and then creating question-answer (QA) pairs guided by Bloom’s taxonomy. This ensures systematic topic coverage and progressively increasing difficulty. The resulting synthetic corpus is used for continual pretraining with an interleaved curriculum schedule, aligning learning across both content and cognitive dimensions. Experiments with Llama 3.2 (1B and 3B) show significant gains in specialized MMLU subsets. In challenging domains like microeconomics, where baselines struggle, ACER boosts accuracy by 5 percentage points. Across all target domains, we observe a consistent macro-average improvement of 3 percentage points. Notably, ACER not only prevents catastrophic forgetting but also facilitates positive cross-domain knowledge transfer, improving performance on non-target domains by 0.7 points. Beyond MMLU, ACER enhances performance on knowledge-intensive benchmarks like ARC and GPQA by over 2 absolute points, while maintaining stable performance on general reasoning tasks. Our results demonstrate that ACER offers a scalable and effective recipe for closing critical domain gaps in LLMs.

[43] MisSynth: Improving MISSCI Logical Fallacies Classification with Synthetic Data

Mykhailo Poliakov, Nadiya Shvai

Main category: cs.CL

TL;DR: MisSynth pipeline uses RAG-generated synthetic fallacy data to fine-tune LLMs, achieving over 35% F1-score improvement for detecting health-related misinformation.

DetailsMotivation: Health misinformation is prevalent and harmful, but difficult to identify when claims distort scientific findings. Limited annotated data makes detection challenging.

Method: Proposed MisSynth pipeline applies retrieval-augmented generation (RAG) to create synthetic fallacy samples, then fine-tunes LLMs using this synthetic data.

Result: Fine-tuned models show substantial accuracy gains - LLaMA 3.1 8B achieved over 35% F1-score absolute improvement on MISSCI test split compared to vanilla baseline.

Conclusion: Synthetic fallacy data augmentation significantly enhances zero-shot LLM classification performance on real-world scientific misinformation tasks, even with limited computational resources.

Abstract: Health-related misinformation is very prevalent and potentially harmful. It is difficult to identify, especially when claims distort or misinterpret scientific findings. We investigate the impact of synthetic data generation and lightweight fine-tuning techniques on the ability of large language models (LLMs) to recognize fallacious arguments using the MISSCI dataset and framework. In this work, we propose MisSynth, a pipeline that applies retrieval-augmented generation (RAG) to produce synthetic fallacy samples, which are then used to fine-tune an LLM model. Our results show substantial accuracy gains with fine-tuned models compared to vanilla baselines. For instance, the LLaMA 3.1 8B fine-tuned model achieved an over 35% F1-score absolute improvement on the MISSCI test split over its vanilla baseline. We demonstrate that introducing synthetic fallacy data to augment limited annotated resources can significantly enhance zero-shot LLM classification performance on real-world scientific misinformation tasks, even with limited computational resources. The code and synthetic dataset are available on https://github.com/mxpoliakov/MisSynth.

[44] On the Role of Context for Discourse Relation Classification in Scientific Writing

Stephen Wan, Wei Liu, Michael Strube

Main category: cs.CL

TL;DR: This paper investigates using pretrained language models and large language models for discourse relation classification in scientific publications, examining how context from discourse structure helps with this task.

DetailsMotivation: With increasing use of generative AI in science workflows, the authors want to use discourse-level information to find supporting evidence for AI-generated scientific claims, starting with discourse structure analysis.

Method: Used pretrained language models (PLMs) and large language models (LLMs) for discourse relation classification (DRC) in scientific publications, examining the role of context defined by discourse structure.

Result: Experiments showed that context, as defined by discourse structure, is generally helpful for DRC. The analysis also identified which scientific discourse relation types benefit most from context.

Conclusion: This preliminary investigation demonstrates the potential of using PLMs and LLMs for discourse relation classification in scientific writing, with context playing a beneficial role in improving classification performance.

Abstract: With the increasing use of generative Artificial Intelligence (AI) methods to support science workflows, we are interested in the use of discourse-level information to find supporting evidence for AI generated scientific claims. A first step towards this objective is to examine the task of inferring discourse structure in scientific writing. In this work, we present a preliminary investigation of pretrained language model (PLM) and Large Language Model (LLM) approaches for Discourse Relation Classification (DRC), focusing on scientific publications, an under-studied genre for this task. We examine how context can help with the DRC task, with our experiments showing that context, as defined by discourse structure, is generally helpful. We also present an analysis of which scientific discourse relation types might benefit most from context.

[45] OmniEduBench: A Comprehensive Chinese Benchmark for Evaluating Large Language Models in Education

Min Zhang, Hao Chen, Hao Chen, Wenqi Zhang, Didi Zhu, Xin Lin, Bo Jiang, Aimin Zhou, Fei Wu, Kun Kuang

Main category: cs.CL

TL;DR: OmniEduBench is a comprehensive Chinese educational benchmark with 24.6K question-answer pairs covering both knowledge and cultivation dimensions across 61 subjects, revealing significant performance gaps between current LLMs and human intelligence in educational applications.

DetailsMotivation: Existing LLM benchmarks focus primarily on knowledge evaluation while neglecting cultivation capabilities essential for real-world education, and lack diversity - particularly in the Chinese context.

Method: Developed OmniEduBench with 24,602 high-quality question-answer pairs divided into knowledge (18,121) and cultivation (6,481) dimensions, each with 6 categories covering 61 subjects and 11 question formats.

Result: Extensive experiments on 11 LLMs show only Gemini-2.5 Pro surpassed 60% accuracy in knowledge dimension, while the best model (QWQ) in cultivation dimension still trailed human intelligence by nearly 30%.

Conclusion: There is substantial room for improvement in LLMs’ educational capabilities, highlighting the challenges of applying LLMs in education and the need for more comprehensive evaluation benchmarks.

Abstract: With the rapid development of large language models (LLMs), various LLM-based works have been widely applied in educational fields. However, most existing LLMs and their benchmarks focus primarily on the knowledge dimension, largely neglecting the evaluation of cultivation capabilities that are essential for real-world educational scenarios. Additionally, current benchmarks are often limited to a single subject or question type, lacking sufficient diversity. This issue is particularly prominent within the Chinese context. To address this gap, we introduce OmniEduBench, a comprehensive Chinese educational benchmark. OmniEduBench consists of 24.602K high-quality question-answer pairs. The data is meticulously divided into two core dimensions: the knowledge dimension and the cultivation dimension, which contain 18.121K and 6.481K entries, respectively. Each dimension is further subdivided into 6 fine-grained categories, covering a total of 61 different subjects (41 in the knowledge and 20 in the cultivation). Furthermore, the dataset features a rich variety of question formats, including 11 common exam question types, providing a solid foundation for comprehensively evaluating LLMs’ capabilities in education. Extensive experiments on 11 mainstream open-source and closed-source LLMs reveal a clear performance gap. In the knowledge dimension, only Gemini-2.5 Pro surpassed 60% accuracy, while in the cultivation dimension, the best-performing model, QWQ, still trailed human intelligence by nearly 30%. These results highlight the substantial room for improvement and underscore the challenges of applying LLMs in education.

[46] 1+1>2: A Synergistic Sparse and Low-Rank Compression Method for Large Language Models

Zeliang Zong, Kai Zhang, Zheyang Li, Wenming Tan, Ye Ren, Yiyan Zhai, Jilin Hu

Main category: cs.CL

TL;DR: SSLC combines sparse optimization and low-rank approximation to compress LLMs without training, achieving 50% compression on Qwen2.5 with no performance loss and 1.63× speedup.

DetailsMotivation: LLMs face bandwidth and computational constraints despite strong language capabilities. While pruning and low-rank methods work individually, their combined potential for LLM compression remains unexplored.

Method: Formulates low-rank approximation and sparse optimization as a unified problem solved through iterative optimization, leveraging low-rank for structural compression and sparsity for weight elimination.

Result: SSLC consistently outperforms standalone methods on LLaMA and Qwen2.5 models (7B-70B), achieving state-of-the-art compression with 50% size reduction on Qwen2.5 and no performance drop.

Conclusion: SSLC provides a practical solution for efficient LLM deployment by synergistically combining sparse and low-rank compression, delivering significant speedup and compression without additional training.

Abstract: Large Language Models (LLMs) have demonstrated remarkable proficiency in language comprehension and generation; however, their widespread adoption is constrained by substantial bandwidth and computational demands. While pruning and low-rank approximation have each demonstrated promising performance individually, their synergy for LLMs remains underexplored. We introduce \underline{S}ynergistic \underline{S}parse and \underline{L}ow-Rank \underline{C}ompression (SSLC) methods for LLMs, which leverages the strengths of both techniques: low-rank approximation compresses the model by retaining its essential structure with minimal information loss, whereas sparse optimization eliminates non-essential weights, preserving those crucial for generalization. Based on theoretical analysis, we first formulate the low-rank approximation and sparse optimization as a unified problem and solve it by iterative optimization algorithm. Experiments on LLaMA and Qwen2.5 models (7B-70B) show that SSLC, without any additional training steps, consistently surpasses standalone methods, achieving state-of-the-arts results. Notably, SSLC compresses Qwen2.5 by 50% with no performance drop and achieves at least 1.63$\times$ speedup, offering a practical solution for efficient LLM deployment.

[47] Bayesian Network Fusion of Large Language Models for Sentiment Analysis

Rasoul Amirzadeh, Dhananjay Thiruvady, Fatemeh Shiri

Main category: cs.CL

TL;DR: BNLF framework integrates predictions from three LLMs (FinBERT, RoBERTa, BERTweet) using Bayesian networks for sentiment analysis, achieving ~6% accuracy improvement over baseline models.

DetailsMotivation: Address challenges with domain-specific LLMs including lack of transparency, high fine-tuning costs, prompt engineering requirements, inconsistent results across domains, and high environmental impact.

Method: Bayesian network LLM fusion (BNLF) framework that performs late fusion by modeling sentiment predictions from multiple LLMs as probabilistic nodes within a Bayesian network.

Result: Evaluated across three human-annotated financial corpora, BNLF demonstrates consistent gains of about six percent in accuracy over baseline LLMs, showing robustness to dataset variability.

Conclusion: Probabilistic fusion through Bayesian networks is effective for interpretable sentiment classification, providing consistent performance improvements across diverse datasets.

Abstract: Large language models (LLMs) continue to advance, with an increasing number of domain-specific variants tailored for specialised tasks. However, these models often lack transparency and explainability, can be costly to fine-tune, require substantial prompt engineering, yield inconsistent results across domains, and impose significant adverse environmental impact due to their high computational demands. To address these challenges, we propose the Bayesian network LLM fusion (BNLF) framework, which integrates predictions from three LLMs, including FinBERT, RoBERTa, and BERTweet, through a probabilistic mechanism for sentiment analysis. BNLF performs late fusion by modelling the sentiment predictions from multiple LLMs as probabilistic nodes within a Bayesian network. Evaluated across three human-annotated financial corpora with distinct linguistic and contextual characteristics, BNLF demonstrates consistent gains of about six percent in accuracy over the baseline LLMs, underscoring its robustness to dataset variability and the effectiveness of probabilistic fusion for interpretable sentiment classification.

[48] A Multi-agent Large Language Model Framework to Automatically Assess Performance of a Clinical AI Triage Tool

Adam E. Flanders, Yifan Peng, Luciano Prevedello, Robyn Ball, Errol Colak, Prahlad Menon, George Shih, Hui-Ming Lin, Paras Lakhani

Main category: cs.CL

TL;DR: An ensemble of multiple LLM agents provides more reliable assessment of AI triage tools than single LLMs, with open-source models matching or outperforming GPT-4o.

DetailsMotivation: To determine if an ensemble of multiple LLM agents could provide more reliable assessment of pixel-based AI triage tools than a single LLM alone.

Method: Analyzed 29,766 non-contrast CT head exams using a commercial ICH AI detection tool. Radiology reports were evaluated by an ensemble of eight open-source LLMs and GPT-4o using multi-shot prompts. Manual review of 1,726 examples was performed.

Result: Llama3.3:70b achieved highest performance (AUC=0.78, AP=0.75, F1=0.81). LLM ensembles (Full-9, Top-3, Consensus) showed better performance than single GPT-4o (MCC 0.571 vs 0.522) with no significant differences between ensemble types.

Conclusion: Ensemble of medium to large open-source LLMs provides more consistent and reliable ground truth evaluation for clinical AI triage tools compared to single LLMs.

Abstract: Purpose: The purpose of this study was to determine if an ensemble of multiple LLM agents could be used collectively to provide a more reliable assessment of a pixel-based AI triage tool than a single LLM. Methods: 29,766 non-contrast CT head exams from fourteen hospitals were processed by a commercial intracranial hemorrhage (ICH) AI detection tool. Radiology reports were analyzed by an ensemble of eight open-source LLM models and a HIPAA compliant internal version of GPT-4o using a single multi-shot prompt that assessed for presence of ICH. 1,726 examples were manually reviewed. Performance characteristics of the eight open-source models and consensus were compared to GPT-4o. Three ideal consensus LLM ensembles were tested for rating the performance of the triage tool. Results: The cohort consisted of 29,766 head CTs exam-report pairs. The highest AUC performance was achieved with llama3.3:70b and GPT-4o (AUC= 0.78). The average precision was highest for Llama3.3:70b and GPT-4o (AP=0.75 & 0.76). Llama3.3:70b had the highest F1 score (0.81) and recall (0.85), greater precision (0.78), specificity (0.72), and MCC (0.57). Using MCC (95% CI) the ideal combination of LLMs were: Full-9 Ensemble 0.571 (0.552-0.591), Top-3 Ensemble 0.558 (0.537-0.579), Consensus 0.556 (0.539-0.574), and GPT4o 0.522 (0.500-0.543). No statistically significant differences were observed between Top-3, Full-9, and Consensus (p > 0.05). Conclusion: An ensemble of medium to large sized open-source LLMs provides a more consistent and reliable method to derive a ground truth retrospective evaluation of a clinical AI triage tool over a single LLM alone.

[49] Inside CORE-KG: Evaluating Structured Prompting and Coreference Resolution for Knowledge Graphs

Dipak Meher, Carlotta Domeniconi

Main category: cs.CL

TL;DR: CORE-KG framework reduces node duplication and noise in legal knowledge graphs by integrating coreference resolution and structured prompts, with ablation study showing structured prompts are more critical for noise reduction while coreference resolution better handles duplication.

DetailsMotivation: Human smuggling networks are adaptive and hard to analyze using unstructured legal documents. Existing LLM-based approaches create noisy, fragmented graphs with duplicate nodes due to lack of guided extraction and coreference resolution.

Method: CORE-KG framework integrates type-aware coreference module and domain-guided structured prompts. Systematic ablation study quantifies contributions of each component.

Result: Removing coreference resolution increases node duplication by 28.32% and noisy nodes by 4.32%. Removing structured prompts increases node duplication by 4.34% and noisy nodes by 73.33%.

Conclusion: Structured prompts are more effective for reducing noise while coreference resolution better addresses duplication. Findings provide empirical insights for designing robust LLM-based pipelines for legal text extraction.

Abstract: Human smuggling networks are increasingly adaptive and difficult to analyze. Legal case documents offer critical insights but are often unstructured, lexically dense, and filled with ambiguous or shifting references, which pose significant challenges for automated knowledge graph (KG) construction. While recent LLM-based approaches improve over static templates, they still generate noisy, fragmented graphs with duplicate nodes due to the absence of guided extraction and coreference resolution. The recently proposed CORE-KG framework addresses these limitations by integrating a type-aware coreference module and domain-guided structured prompts, significantly reducing node duplication and legal noise. In this work, we present a systematic ablation study of CORE-KG to quantify the individual contributions of its two key components. Our results show that removing coreference resolution results in a 28.32% increase in node duplication and a 4.32% increase in noisy nodes, while removing structured prompts leads to a 4.34% increase in node duplication and a 73.33% increase in noisy nodes. These findings offer empirical insights for designing robust LLM-based pipelines for extracting structured representations from complex legal texts.

[50] Hebrew Diacritics Restoration using Visual Representation

Yair Elboher, Yuval Pinter

Main category: cs.CL

TL;DR: DIVRIT is a novel Hebrew diacritization system that frames the task as zero-shot classification using a Hebrew Visual Language Model that processes undiacritized text as images.

DetailsMotivation: Hebrew diacritics restoration is crucial for accurate pronunciation and meaning disambiguation, but the language has high ambiguity when unvocalized. Recent ML approaches have advanced performance, but there's need for more effective methods.

Method: Frames diacritization as zero-shot classification at word level, selecting diacritization patterns from dynamically generated candidates conditioned on context. Uses Hebrew Visual Language Model that processes undiacritized text as images to embed diacritic information directly in vector representations.

Result: In oracle settings with correct diacritized forms among candidates, DIVRIT achieves high accuracy. Architectural enhancements and optimized training yield significant improvements in generalization capabilities.

Conclusion: The findings highlight the promising potential of visual representations for accurate and automated Hebrew diacritization without relying on complex linguistic analysis.

Abstract: Diacritics restoration in Hebrew is a fundamental task for ensuring accurate word pronunciation and disambiguating textual meaning. Despite the language’s high degree of ambiguity when unvocalized, recent machine learning approaches have significantly advanced performance on this task. In this work, we present DIVRIT, a novel system for Hebrew diacritization that frames the task as a zero-shot classification problem. Our approach operates at the word level, selecting the most appropriate diacritization pattern for each undiacritized word from a dynamically generated candidate set, conditioned on the surrounding textual context. A key innovation of DIVRIT is its use of a Hebrew Visual Language Model, which processes undiacritized text as an image, allowing diacritic information to be embedded directly within the input’s vector representation. Through a comprehensive evaluation across various configurations, we demonstrate that the system effectively performs diacritization without relying on complex, explicit linguistic analysis. Notably, in an ``oracle’’ setting where the correct diacritized form is guaranteed to be among the provided candidates, DIVRIT achieves a high level of accuracy. Furthermore, strategic architectural enhancements and optimized training methodologies yield significant improvements in the system’s overall generalization capabilities. These findings highlight the promising potential of visual representations for accurate and automated Hebrew diacritization.

[51] The Structure of Relation Decoding Linear Operators in Large Language Models

Miranda Anna Christ, Adrián Csiszárik, Gergely Becsó, Dániel Varga

Main category: cs.CL

TL;DR: Linear relational decoders in transformers are highly compressible and operate on coarse-grained semantic properties rather than distinct relations, explaining their redundancy and limited generalization.

DetailsMotivation: To understand the organization and structure of linear operators that decode relational facts in transformer language models, extending single-relation findings to collections of relations.

Method: Systematically charted relation decoder organization, compressed them using order-3 tensor networks, and developed a cross-evaluation protocol to test decoders on subjects of different relations.

Result: Linear decoders are highly compressible without accuracy loss and extract recurring semantic properties rather than distinct relations, explaining their redundancy.

Conclusion: Linear relational decoding in transformers is primarily property-based rather than relation-specific, clarifying compressibility and generalization limitations.

Abstract: This paper investigates the structure of linear operators introduced in Hernandez et al. [2023] that decode specific relational facts in transformer language models. We extend their single-relation findings to a collection of relations and systematically chart their organization. We show that such collections of relation decoders can be highly compressed by simple order-3 tensor networks without significant loss in decoding accuracy. To explain this surprising redundancy, we develop a cross-evaluation protocol, in which we apply each linear decoder operator to the subjects of every other relation. Our results reveal that these linear maps do not encode distinct relations, but extract recurring, coarse-grained semantic properties (e.g., country of capital city and country of food are both in the country-of-X property). This property-centric structure clarifies both the operators’ compressibility and highlights why they generalize only to new relations that are semantically close. Our findings thus interpret linear relational decoding in transformer language models as primarily property-based, rather than relation-specific.

[52] InfoFlow: Reinforcing Search Agent Via Reward Density Optimization

Kun Luo, Hongjin Qian, Zheng Liu, Ziyi Xia, Shitao Xiao, Siqi Bao, Jun Zhao, Kang Liu

Main category: cs.CL

TL;DR: InfoFlow addresses low reward density in RLVR by decomposing tasks, providing failure-guided hints, and using dual-agent architecture to improve reward per exploration cost.

DetailsMotivation: RLVR faces challenges with low reward density in deep search scenarios where agents expend high exploratory costs for infrequent rewards.

Method: InfoFlow uses subproblem decomposition for process rewards, failure-guided hints for corrective guidance, and dual-agent refinement to compress search history and reduce exploration burden.

Result: InfoFlow significantly outperforms baselines on agentic search benchmarks, enabling lightweight LLMs to achieve performance comparable to advanced proprietary LLMs.

Conclusion: The InfoFlow framework effectively addresses reward density optimization in RLVR, making deep search more efficient and accessible with lightweight models.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) is a promising approach for enhancing agentic deep search. However, its application is often hindered by low \textbf{Reward Density} in deep search scenarios, where agents expend significant exploratory costs for infrequent and often null final rewards. In this paper, we formalize this challenge as the \textbf{Reward Density Optimization} problem, which aims to improve the reward obtained per unit of exploration cost. This paper introduce \textbf{InfoFlow}, a systematic framework that tackles this problem from three aspects. 1) \textbf{Subproblem decomposition}: breaking down long-range tasks to assign process rewards, thereby providing denser learning signals. 2) \textbf{Failure-guided hints}: injecting corrective guidance into stalled trajectories to increase the probability of successful outcomes. 3) \textbf{Dual-agent refinement}: employing a dual-agent architecture to offload the cognitive burden of deep exploration. A refiner agent synthesizes the search history, which effectively compresses the researcher’s perceived trajectory, thereby reducing exploration cost and increasing the overall reward density. We evaluate InfoFlow on multiple agentic search benchmarks, where it significantly outperforms strong baselines, enabling lightweight LLMs to achieve performance comparable to advanced proprietary LLMs.

[53] Inference-Cost-Aware Dynamic Tree Construction for Efficient Inference in Large Language Models

Yinrong Hong, Zhiquan Tan, Kai Hu

Main category: cs.CL

TL;DR: CAST is a dynamic tree decoding approach that optimizes speculative decoding by considering GPU configurations and batch sizes, achieving up to 5.2x speedup over conventional methods.

DetailsMotivation: Address LLM inference latency challenges from autoregressive design and large model size, while existing speculative decoding approaches neglect system variables like GPU devices and batch sizes.

Method: Dynamic tree decoding approach that incorporates inference costs (GPU configurations, batch sizes) to dynamically refine tree structure during speculative decoding.

Result: Achieves speeds up to 5.2x faster than conventional decoding methods and outperforms state-of-the-art techniques by 5-20% across six diverse tasks using six distinct LLMs.

Conclusion: CAST effectively reduces LLM inference latency by dynamically optimizing tree structures based on system variables, demonstrating significant performance improvements over existing methods.

Abstract: Large Language Models (LLMs) face significant inference latency challenges stemming from their autoregressive design and large size. To address this, speculative decoding emerges as a solution, enabling the simultaneous generation and validation of multiple tokens. While recent approaches like EAGLE-2 and EAGLE-3 improve speculative decoding using dynamic tree structures, they often neglect the impact of crucial system variables such as GPU devices and batch sizes. Therefore, we introduce a new dynamic tree decoding approach called CAST that takes into account inference costs, including factors such as GPU configurations and batch sizes, to dynamically refine the tree structure. Through comprehensive experimentation across six diverse tasks and utilizing six distinct LLMs, our methodology demonstrates remarkable results, achieving speeds up to 5.2 times faster than conventional decoding methods. Moreover, it generally outperforms existing state-of-the-art techniques from 5% to 20%.

[54] SlideAgent: Hierarchical Agentic Framework for Multi-Page Visual Document Understanding

Yiqiao Jin, Rachneet Kaur, Zhen Zeng, Sumitra Ganesh, Srijan Kumar

Main category: cs.CL

TL;DR: SlideAgent is an agentic framework that improves multi-page visual document understanding through specialized agents working at global, page, and element levels, achieving significant performance gains over existing models.

DetailsMotivation: Current LLMs struggle with complex multi-page visual documents that require fine-grained reasoning over elements and pages, particularly for documents like slide decks that use layout, colors, icons, and cross-references to convey information.

Method: Uses specialized agents organized into three reasoning levels (global, page, element) to construct structured, query-agnostic representations. During inference, selectively activates agents for multi-level reasoning and integrates their outputs.

Result: Achieves significant improvements over both proprietary (+7.9 overall) and open-source models (+9.8 overall) in extensive experiments.

Conclusion: SlideAgent provides an effective framework for understanding multi-modal, multi-page, and multi-layout documents through hierarchical agent-based reasoning.

Abstract: Multi-page visual documents such as manuals, brochures, presentations, and posters convey key information through layout, colors, icons, and cross-slide references. While large language models (LLMs) offer opportunities in document understanding, current systems struggle with complex, multi-page visual documents, particularly in fine-grained reasoning over elements and pages. We introduce SlideAgent, a versatile agentic framework for understanding multi-modal, multi-page, and multi-layout documents, especially slide decks. SlideAgent employs specialized agents and decomposes reasoning into three specialized levels-global, page, and element-to construct a structured, query-agnostic representation that captures both overarching themes and detailed visual or textual cues. During inference, SlideAgent selectively activates specialized agents for multi-level reasoning and integrates their outputs into coherent, context-aware answers. Extensive experiments show that SlideAgent achieves significant improvement over both proprietary (+7.9 overall) and open-source models (+9.8 overall).

[55] Encoder-Decoder or Decoder-Only? Revisiting Encoder-Decoder Large Language Model

Biao Zhang, Yong Cheng, Siamak Shakeri, Xinyi Wang, Min Ma, Orhan Firat

Main category: cs.CL

TL;DR: This paper revisits encoder-decoder LLMs (RedLLM) enhanced with recent decoder-only recipes, comparing them with decoder-only LLMs (DecLLM) across scales from 150M to 8B parameters, showing RedLLM has compelling scaling properties and strong performance.

DetailsMotivation: The rapid transition from encoder-decoder to decoder-only LLM architectures occurred without rigorous comparative analysis from a scaling perspective, potentially overlooking encoder-decoder models' potential.

Method: Enhanced encoder-decoder LLMs (RedLLM) with recent decoder-only recipes, pretrained with prefix language modeling on RedPajama V1 (1.6T tokens), compared against decoder-only LLMs (DecLLM) pretrained with causal LM across 150M to 8B parameter scales, with FLAN instruction tuning.

Result: RedLLM shows compelling scaling properties and surprisingly strong performance. While DecLLM is more compute-optimal during pretraining, RedLLM demonstrates comparable scaling and context length extrapolation capabilities. After instruction tuning, RedLLM achieves comparable or better downstream task results with substantially better inference efficiency.

Conclusion: Encoder-decoder LLMs (RedLLM) deserve re-examination as they show strong potential for developing powerful and efficient LLMs, with comparable performance to decoder-only models and better inference efficiency after instruction tuning.

Abstract: Recent large language model (LLM) research has undergone an architectural shift from encoder-decoder modeling to nowadays the dominant decoder-only modeling. This rapid transition, however, comes without a rigorous comparative analysis especially \textit{from the scaling perspective}, raising concerns that the potential of encoder-decoder models may have been overlooked. To fill this gap, we revisit encoder-decoder LLM (RedLLM), enhancing it with recent recipes from decoder-only LLM (DecLLM). We conduct a comprehensive comparison between RedLLM, pretrained with prefix language modeling (LM), and DecLLM, pretrained with causal LM, at different model scales, ranging from $\sim$150M to $\sim$8B. Using RedPajama V1 (1.6T tokens) for pretraining and FLAN for instruction tuning, our experiments show that RedLLM produces compelling scaling properties and surprisingly strong performance. While DecLLM is overall more compute-optimal during pretraining, RedLLM demonstrates comparable scaling and context length extrapolation capabilities. After instruction tuning, RedLLM achieves comparable and even better results on various downstream tasks while enjoying substantially better inference efficiency. We hope our findings could inspire more efforts on re-examining RedLLM, unlocking its potential for developing powerful and efficient LLMs.

[56] Evontree: Ontology Rule-Guided Self-Evolution of Large Language Models

Mingchen Tu, Zhiqiang Liu, Juan Li, Liangyurui Liu, Junjie Wang, Lei Liang, Wen Zhang

Main category: cs.CL

TL;DR: Evontree is a framework that uses ontology rules to extract, validate, and enhance domain knowledge in LLMs without extensive external datasets, achieving up to 3.7% accuracy improvement in medical QA benchmarks.

DetailsMotivation: LLMs struggle in data-sensitive domains like healthcare due to lack of high-quality domain-specific training data, while domain experts have created valuable ontology rules that formalize concept relationships.

Method: Extracts domain ontology from raw LLMs, detects inconsistencies using two core ontology rules, and reinforces refined knowledge via self-distilled fine-tuning.

Result: Consistent outperformance over unmodified models and supervised baselines on medical QA benchmarks with Llama3-8B-Instruct and Med42-v2, achieving up to 3.7% accuracy improvement.

Conclusion: The approach is effective, efficient, and robust for low-resource domain adaptation of LLMs, demonstrating the value of leveraging existing ontology rules for knowledge enhancement.

Abstract: Large language models (LLMs) have demonstrated exceptional capabilities across multiple domains by leveraging massive pre-training and curated fine-tuning data. However, in data-sensitive fields such as healthcare, the lack of high-quality, domain-specific training corpus hinders LLMs’ adaptation for specialized applications. Meanwhile, domain experts have distilled domain wisdom into ontology rules, which formalize relationships among concepts and ensure the integrity of knowledge management repositories. Viewing LLMs as implicit repositories of human knowledge, we propose Evontree, a novel framework that leverages a small set of high-quality ontology rules to systematically extract, validate, and enhance domain knowledge within LLMs, without requiring extensive external datasets. Specifically, Evontree extracts domain ontology from raw models, detects inconsistencies using two core ontology rules, and reinforces the refined knowledge via self-distilled fine-tuning. Extensive experiments on medical QA benchmarks with Llama3-8B-Instruct and Med42-v2 demonstrate consistent outperformance over both unmodified models and leading supervised baselines, achieving up to a 3.7% improvement in accuracy. These results confirm the effectiveness, efficiency, and robustness of our approach for low-resource domain adaptation of LLMs.

[57] Kimi Linear: An Expressive, Efficient Attention Architecture

Kimi Team, Yu Zhang, Zongyu Lin, Xingcheng Yao, Jiaxi Hu, Fanqing Meng, Chengyin Liu, Xin Men, Songlin Yang, Zhiyuan Li, Wentao Li, Enzhe Lu, Weizhou Liu, Yanru Chen, Weixin Xu, Longhui Yu, Yejie Wang, Yu Fan, Longguang Zhong, Enming Yuan, Dehao Zhang, Yizhi Zhang, T. Y. Liu, Haiming Wang, Shengjun Fang, Weiran He, Shaowei Liu, Yiwei Li, Jianlin Su, Jiezhong Qiu, Bo Pang, Junjie Yan, Zhejun Jiang, Weixiao Huang, Bohong Yin, Jiacheng You, Chu Wei, Zhengtao Wang, Chao Hong, Yutian Chen, Guanduo Chen, Yucheng Wang, Huabin Zheng, Feng Wang, Yibo Liu, Mengnan Dong, Zheng Zhang, Siyuan Pan, Wenhao Wu, Yuhao Wu, Longyu Guan, Jiawen Tao, Guohong Fu, Xinran Xu, Yuzhi Wang, Guokun Lai, Yuxin Wu, Xinyu Zhou, Zhilin Yang, Yulun Du

Main category: cs.CL

TL;DR: Kimi Linear is a hybrid linear attention architecture that outperforms full attention across various scenarios while reducing KV cache usage by up to 75% and achieving 6x decoding throughput for 1M context.

DetailsMotivation: To develop a linear attention architecture that can outperform full attention under fair comparisons while maintaining efficiency, especially for long-context scenarios.

Method: Uses Kimi Delta Attention (KDA) with finer-grained gating mechanism and bespoke chunkwise algorithm with specialized DPLR transition matrices. Pretrains a 3B activated parameter model with hybrid KDA and Multi-Head Latent Attention.

Result: Outperforms full MLA with significant margin across all tasks, reduces KV cache usage by up to 75%, achieves 6x decoding throughput for 1M context, and serves as drop-in replacement for full attention with superior performance and efficiency.

Conclusion: Kimi Linear demonstrates that linear attention can outperform full attention while providing substantial efficiency gains, making it a viable replacement for traditional attention architectures.

Abstract: We introduce Kimi Linear, a hybrid linear attention architecture that, for the first time, outperforms full attention under fair comparisons across various scenarios – including short-context, long-context, and reinforcement learning (RL) scaling regimes. At its core lies Kimi Delta Attention (KDA), an expressive linear attention module that extends Gated DeltaNet with a finer-grained gating mechanism, enabling more effective use of limited finite-state RNN memory. Our bespoke chunkwise algorithm achieves high hardware efficiency through a specialized variant of the Diagonal-Plus-Low-Rank (DPLR) transition matrices, which substantially reduces computation compared to the general DPLR formulation while remaining more consistent with the classical delta rule. We pretrain a Kimi Linear model with 3B activated parameters and 48B total parameters, based on a layerwise hybrid of KDA and Multi-Head Latent Attention (MLA). Our experiments show that with an identical training recipe, Kimi Linear outperforms full MLA with a sizeable margin across all evaluated tasks, while reducing KV cache usage by up to 75% and achieving up to 6 times decoding throughput for a 1M context. These results demonstrate that Kimi Linear can be a drop-in replacement for full attention architectures with superior performance and efficiency, including tasks with longer input and output lengths. To support further research, we open-source the KDA kernel and vLLM implementations, and release the pre-trained and instruction-tuned model checkpoints.

[58] The End of Manual Decoding: Towards Truly End-to-End Language Models

Zhichao Wang, Dongyang Ma, Xinting Huang, Deng Cai, Tian Lan, Jiahao Xu, Haitao Mi, Xiaoying Tang, Yan Wang

Main category: cs.CL

TL;DR: AutoDeco enables truly end-to-end LLM generation by learning to control decoding parameters (temperature, top-p) dynamically per token, eliminating manual hyperparameter tuning.

DetailsMotivation: Current LLMs require laborious manual tuning of non-differentiable decoding hyperparameters like temperature and top-p, making 'end-to-end' generation a misnomer.

Method: Augment standard transformer with lightweight heads that dynamically predict context-specific temperature and top-p values alongside next-token logits in a single forward pass.

Result: Outperforms default decoding strategies and achieves performance comparable to oracle-tuned baseline across eight benchmarks; demonstrates emergent instruction-based decoding control.

Conclusion: AutoDeco transforms decoding into parametric token-level process, enabling self-regulated sampling and opening new paradigm for steerable, interactive LLM decoding.

Abstract: The “end-to-end” label for LLMs is a misnomer. In practice, they depend on a non-differentiable decoding process that requires laborious, hand-tuning of hyperparameters like temperature and top-p. This paper introduces AutoDeco, a novel architecture that enables truly “end-to-end” generation by learning to control its own decoding strategy. We augment the standard transformer with lightweight heads that, at each step, dynamically predict context-specific temperature and top-p values alongside the next-token logits. This approach transforms decoding into a parametric, token-level process, allowing the model to self-regulate its sampling strategy within a single forward pass. Through extensive experiments on eight benchmarks, we demonstrate that AutoDeco not only significantly outperforms default decoding strategies but also achieves performance comparable to an oracle-tuned baseline derived from “hacking the test set”-a practical upper bound for any static method. Crucially, we uncover an emergent capability for instruction-based decoding control: the model learns to interpret natural language commands (e.g., “generate with low randomness”) and adjusts its predicted temperature and top-p on a token-by-token basis, opening a new paradigm for steerable and interactive LLM decoding.

[59] Value Drifts: Tracing Value Alignment During LLM Post-Training

Mehar Bhatia, Shravan Nayak, Gaurav Kamath, Marius Mosbach, Karolina Stańczak, Vered Shwartz, Siva Reddy

Main category: cs.CL

TL;DR: This paper investigates when and how value alignment emerges during LLM post-training, finding that supervised fine-tuning (SFT) establishes core values while preference optimization rarely re-aligns them, with different algorithms producing different outcomes even with identical preference data.

DetailsMotivation: As LLMs become more influential, understanding how they align with human values is crucial. Previous work focused on evaluating fully trained models, missing insights into the training dynamics of value learning.

Method: Analyzed value alignment dynamics during post-training using Llama-3 and Qwen-3 models of various sizes. Used supervised fine-tuning (SFT) and preference optimization datasets/algorithms, plus a synthetic preference dataset for controlled value manipulation.

Result: SFT phase generally establishes a model’s core values, and subsequent preference optimization rarely re-aligns these values. Different preference optimization algorithms lead to different value alignment outcomes even when preference data is identical.

Conclusion: The findings provide actionable insights for data curation and algorithm selection to improve model alignment with human values, emphasizing that value alignment is primarily established during SFT rather than preference optimization phases.

Abstract: As LLMs occupy an increasingly important role in society, they are more and more confronted with questions that require them not only to draw on their general knowledge but also to align with certain human value systems. Therefore, studying the alignment of LLMs with human values has become a crucial field of inquiry. Prior work, however, mostly focuses on evaluating the alignment of fully trained models, overlooking the training dynamics by which models learn to express human values. In this work, we investigate how and at which stage value alignment arises during the course of a model’s post-training. Our analysis disentangles the effects of post-training algorithms and datasets, measuring both the magnitude and time of value drifts during training. Experimenting with Llama-3 and Qwen-3 models of different sizes and popular supervised fine-tuning (SFT) and preference optimization datasets and algorithms, we find that the SFT phase generally establishes a model’s values, and subsequent preference optimization rarely re-aligns these values. Furthermore, using a synthetic preference dataset that enables controlled manipulation of values, we find that different preference optimization algorithms lead to different value alignment outcomes, even when preference data is held constant. Our findings provide actionable insights into how values are learned during post-training and help to inform data curation, as well as the selection of models and algorithms for preference optimization to improve model alignment to human values.

[60] AMO-Bench: Large Language Models Still Struggle in High School Math Competitions

Shengnan An, Xunliang Cai, Xuezhi Cao, Xiaoyu Li, Yehao Lin, Junlin Liu, Xinxuan Lv, Dan Ma, Xuanlin Wang, Ziwen Wang, Shuang Zhou

Main category: cs.CL

TL;DR: AMO-Bench is a new mathematical reasoning benchmark with Olympiad-level difficulty, featuring 50 original problems validated by experts to assess advanced LLM capabilities beyond existing saturated benchmarks.

DetailsMotivation: Existing math benchmarks are becoming ineffective due to performance saturation in top LLMs, requiring more challenging problems to properly evaluate advanced mathematical reasoning capabilities.

Method: Created 50 human-crafted problems cross-validated by experts to meet IMO difficulty standards, using entirely original problems to prevent data memorization, with answer-only format for automatic grading.

Result: Even the best-performing LLM achieved only 52.4% accuracy, with most models scoring below 40%, but showed promising scaling trends with increased test-time compute.

Conclusion: AMO-Bench reveals significant room for improvement in LLMs’ mathematical reasoning and provides a challenging benchmark for advancing reasoning abilities in language models.

Abstract: We present AMO-Bench, an Advanced Mathematical reasoning benchmark with Olympiad level or even higher difficulty, comprising 50 human-crafted problems. Existing benchmarks have widely leveraged high school math competitions for evaluating mathematical reasoning capabilities of large language models (LLMs). However, many existing math competitions are becoming less effective for assessing top-tier LLMs due to performance saturation (e.g., AIME24/25). To address this, AMO-Bench introduces more rigorous challenges by ensuring all 50 problems are (1) cross-validated by experts to meet at least the International Mathematical Olympiad (IMO) difficulty standards, and (2) entirely original problems to prevent potential performance leakages from data memorization. Moreover, each problem in AMO-Bench requires only a final answer rather than a proof, enabling automatic and robust grading for evaluation. Experimental results across 26 LLMs on AMO-Bench show that even the best-performing model achieves only 52.4% accuracy on AMO-Bench, with most LLMs scoring below 40%. Beyond these poor performances, our further analysis reveals a promising scaling trend with increasing test-time compute on AMO-Bench. These results highlight the significant room for improving the mathematical reasoning in current LLMs. We release AMO-Bench to facilitate further research into advancing the reasoning abilities of language models. https://amo-bench.github.io/

[61] Gistify! Codebase-Level Understanding via Runtime Execution

Hyunji Lee, Minseon Kim, Chinmay Singh, Matheus Pereira, Atharv Sonwane, Isadora White, Elias Stengel-Eskin, Mohit Bansal, Zhengyan Shi, Alessandro Sordoni, Marc-Alexandre Côté, Xingdi Yuan, Lucas Caccia

Main category: cs.CL

TL;DR: Gistify is a task where coding LLMs must create minimal, self-contained files that reproduce specific codebase functionalities, requiring structural understanding, execution flow modeling, and large code patch generation.

DetailsMotivation: As coding agents are increasingly deployed in large codebases, there is a need for automated, challenging codebase-level evaluation to test their capabilities.

Method: Propose Gistify task where coding LLMs are given full codebase access and a specific entrypoint, and must generate minimal files that replicate the same output while containing only essential components.

Result: Current state-of-the-art models struggle to reliably solve Gistify tasks, especially those with long execution traces.

Conclusion: Gistify serves as a challenging evaluation framework that reveals limitations in current coding LLMs’ abilities to understand codebase structure, model execution flow, and generate large code patches.

Abstract: As coding agents are increasingly deployed in large codebases, the need to automatically design challenging, codebase-level evaluation is central. We propose Gistify, a task where a coding LLM must create a single, minimal, self-contained file that can reproduce a specific functionality of a codebase. The coding LLM is given full access to a codebase along with a specific entrypoint (e.g., a python command), and the generated file must replicate the output of the same command ran under the full codebase, while containing only the essential components necessary to execute the provided command. Success on Gistify requires both structural understanding of the codebase, accurate modeling of its execution flow as well as the ability to produce potentially large code patches. Our findings show that current state-of-the-art models struggle to reliably solve Gistify tasks, especially ones with long executions traces.

[62] Are LLMs Rigorous Logical Reasoners? Empowering Natural Language Proof Generation by Stepwise Decoding with Contrastive Learning

Ying Su, Mingwen Liu, Zhijiang Guo

Main category: cs.CL

TL;DR: Proposes stepwise decoding with contrastive learning to address LLM decoding errors in logical reasoning, improving proof planning quality while reducing computational costs.

DetailsMotivation: Current LLM-based proof planning systems have evolved to multi-stage approaches but introduce increased search efforts and computational costs, while the generative process itself remains underexplored.

Method: Stepwise decoding approach augmented by contrastive learning, fine-tuning language models using both vanilla and enhanced hard negatives to mitigate common decoding errors.

Result: Empirical results demonstrate the effectiveness of the proposed strategy in improving logical reasoning performance.

Conclusion: Even larger LLMs still struggle to generate rigorous logical chains, highlighting the need for improved decoding methods in logical reasoning tasks.

Abstract: Logical reasoning is a pivotal component in the field of artificial intelligence. Proof planning, particularly in contexts requiring the validation of explanation accuracy, continues to present challenges. The recent advancement of large language models (LLMs) has led to significant progress in natural language proof planning, evolving from one-stage generators to more complex three-stage systems that include additional searchers or verifiers. While these assisted methods improve the quality of generated results, they also introduce increased search efforts and computational costs. Furthermore, the generative process itself remains underexplored. In this study, we propose a stepwise decoding approach augmented by contrastive learning to address two common errors encountered during the LLM generator’s decoding process. We fine-tune the language model using both vanilla and enhanced hard negatives to mitigate these decoding errors. Empirical results demonstrate the effectiveness of our strategy. Additionally, our further analysis reveals that even larger LLMs still struggle to generate rigorous logical chains.

[63] The LSCD Benchmark: a Testbed for Diachronic Word Meaning Tasks

Dominik Schlechtweg, Sachin Yadav, Nikolay Arefyev

Main category: cs.CL

TL;DR: A benchmark repository for Lexical Semantic Change Detection (LSCD) that standardizes evaluation across modular components (WiC, WSI, LSCD) to improve reproducibility and model optimization.

DetailsMotivation: Current LSCD approaches suffer from heterogeneity in datasets, preprocessing, and evaluation metrics, making model comparison and reproduction difficult.

Method: Developed a standardized benchmark repository that allows modular evaluation of WiC, WSI, and LSCD components, enabling transparent implementation and free combination of different model components.

Result: The benchmark enables systematic model evaluation and optimization, leading to improvements in state-of-the-art performance through carefully designed experiments.

Conclusion: Standardized benchmarking addresses reproducibility issues in LSCD research and provides new pathways for model optimization through modular component evaluation.

Abstract: Lexical Semantic Change Detection (LSCD) is a complex, lemma-level task, which is usually operationalized based on two subsequently applied usage-level tasks: First, Word-in-Context (WiC) labels are derived for pairs of usages. Then, these labels are represented in a graph on which Word Sense Induction (WSI) is applied to derive sense clusters. Finally, LSCD labels are derived by comparing sense clusters over time. This modularity is reflected in most LSCD datasets and models. It also leads to a large heterogeneity in modeling options and task definitions, which is exacerbated by a variety of dataset versions, preprocessing options and evaluation metrics. This heterogeneity makes it difficult to evaluate models under comparable conditions, to choose optimal model combinations or to reproduce results. Hence, we provide a benchmark repository standardizing LSCD evaluation. Through transparent implementation results become easily reproducible and by standardization different components can be freely combined. The repository reflects the task’s modularity by allowing model evaluation for WiC, WSI and LSCD. This allows for careful evaluation of increasingly complex model components providing new ways of model optimization. We use the implemented benchmark to conduct a number of experiments with recent models and systematically improve the state-of-the-art.

[64] Speak & Spell: LLM-Driven Controllable Phonetic Error Augmentation for Robust Dialogue State Tracking

Jihyun Lee, Solee Im, Wonjun Lee, Gary Geunbae Lee

Main category: cs.CL

TL;DR: A data augmentation method improves DST robustness in spoken dialogues by targeting ASR-induced named entity errors using phonetic similarity and keyword-highlighted prompts.

DetailsMotivation: DST accuracy significantly drops in spoken dialogue environments due to named entity errors from ASR systems.

Method: Simple data augmentation method that controls error placement using keyword-highlighted prompts and introduces phonetically similar errors.

Result: Generated sufficient error patterns on keywords, leading to improved accuracy in noised and low-accuracy ASR environments.

Conclusion: The proposed method effectively enhances DST model robustness against ASR-induced named entity errors.

Abstract: Dialogue State Tracking (DST) is a key part of task-oriented dialogue systems, identifying important information in conversations. However, its accuracy drops significantly in spoken dialogue environments due to named entity errors from Automatic Speech Recognition (ASR) systems. We introduce a simple yet effective data augmentation method that targets those entities to improve the robustness of DST model. Our novel method can control the placement of errors using keyword-highlighted prompts while introducing phonetically similar errors. As a result, our method generated sufficient error patterns on keywords, leading to improved accuracy in noised and low-accuracy ASR environments.

[65] Language Model Preference Evaluation with Multiple Weak Evaluators

Zhengyu Hu, Jieyu Zhang, Zhihan Xiong, Alexander Ratner, Kaize Ding, Ranjay Krishna

Main category: cs.CL

TL;DR: PGED introduces a multi-evaluator approach using preference graphs to address cyclic preference issues in LLM evaluation, outperforming single-evaluator methods.

DetailsMotivation: Existing LLM evaluation methods using a single strong LLM as judge are vulnerable to cyclic preferences (A>B, B>C, C>A), causing contradictory results.

Method: PGED uses multiple model-based evaluators to construct preference graphs, then ensembles and denoises these graphs to ensure acyclic, non-contradictory evaluation results.

Result: Extensive experiments on ten benchmarks show PGED’s superiority in model ranking, response selection, and data selection. Small LLM evaluators combined with PGED outperform strong single evaluators.

Conclusion: PGED effectively enhances evaluation reliability and improves model performance by addressing cyclic preference issues through multi-evaluator graph ensemble and denoising.

Abstract: Despite the remarkable success of Large Language Models (LLMs), evaluating their outputs’ quality regarding preference remains a critical challenge. While existing works usually leverage a strong LLM as the judge for comparing LLMs' response pairwisely, such a single-evaluator approach is vulnerable to cyclic preference, i.e., output A is better than B, B than C, but C is better than A, causing contradictory evaluation results. To address this, we introduce PGED (Preference Graph Ensemble and Denoise), a novel approach that leverages multiple model-based evaluators to construct preference graphs, and then ensembles and denoises these graphs for acyclic, non-contradictory evaluation results. We provide theoretical guarantees for our framework, demonstrating its efficacy in recovering the ground truth preference structure. Extensive experiments on ten benchmarks demonstrate PGED ’s superiority in three applications: 1) model ranking for evaluation, 2) response selection for test-time scaling, and 3) data selection for model fine-tuning. Notably, PGED combines small LLM evaluators (e.g., Llama3-8B, Mistral-7B, Qwen2-7B) to outperform strong ones (e.g., Qwen2-72B), showcasing its effectiveness in enhancing evaluation reliability and improving model performance.

[66] This Candidate is [MASK]. Prompt-based Sentiment Extraction and Reference Letters

Fabian Slonimczyk

Main category: cs.CL

TL;DR: A prompt-based method for extracting sentiment from text using pre-trained LLMs without fine-tuning, applied to reference letters to show sentiment affects job market outcomes and reveals gender bias.

DetailsMotivation: To develop a simple, effective method for sentiment analysis that doesn't require data preprocessing or labeled data, overcoming limitations of traditional approaches in economics and finance.

Method: Prompt-based sentiment extraction using pre-trained large language models without fine-tuning or labeled data, applied to confidential reference letters.

Result: Higher average sentiment in reference letters correlates with better job market outcomes; sentiment dispersion negatively affects performance; gender differences in letter content (women: grindstone traits, men: standout traits) negatively impact women’s outcomes.

Conclusion: Prompt-based sentiment extraction outperforms traditional methods and reveals meaningful patterns in job market outcomes and gender bias in reference letters.

Abstract: I propose a relatively simple way to deploy pre-trained large language models (LLMs) in order to extract sentiment and other useful features from text data. The method, which I refer to as prompt-based sentiment extraction, offers multiple advantages over other methods used in economics and finance. In particular, it accepts the text input as is (without pre-processing) and produces a sentiment score that has a probability interpretation. Unlike other LLM-based approaches, it does not require any fine-tuning or labeled data. I apply my prompt-based strategy to a hand-collected corpus of confidential reference letters (RLs). I show that the sentiment contents of RLs are clearly reflected in job market outcomes. Candidates with higher average sentiment in their RLs perform markedly better regardless of the measure of success chosen. Moreover, I show that sentiment dispersion among letter writers negatively affects the job market candidate’s performance. I compare my sentiment extraction approach to other commonly used methods for sentiment analysis: bag-of-words' approaches, fine-tuned language models, and querying advanced chatbots. No other method can fully reproduce the results obtained by prompt-based sentiment extraction. Finally, I slightly modify the method to obtain gendered’ sentiment scores (as in Eberhardt et al., 2023). I show that RLs written for female candidates emphasize grindstone' personality traits, whereas male candidates' letters emphasize standout’ traits. These gender differences negatively affect women’s job market outcomes.

[67] Diversity as a Reward: Fine-Tuning LLMs on a Mixture of Domain-Undetermined Data

Zhenqing Ling, Daoyuan Chen, Liuyi Yao, Qianli Shen, Yaliang Li, Ying Shen

Main category: cs.CL

TL;DR: Proposes a dual-identity method for LLMs to self-select diverse training data, addressing challenges with missing domain labels and multi-domain performance balancing.

DetailsMotivation: Existing methods struggle with data that has missing/imprecise domain labels, while data selection methods have difficulty balancing multi-domain performance.

Method: Gives LLMs dual identity: as output model to probe/select data based on diversity reward, and as input model to be tuned with selected data.

Result: Notably boosts performance across domain-undetermined data and foundational downstream tasks for various advanced LLMs.

Conclusion: The study advances understanding of data diversity and enables feedback-driven data-model co-design for LLMs.

Abstract: Fine-tuning large language models (LLMs) using diverse datasets is crucial for enhancing their overall performance across various domains. In practical scenarios, existing methods based on modeling the mixture proportions of data composition often struggle with data whose domain labels are missing, imprecise or non-normalized, while methods based on data selection usually encounter difficulties in balancing multi-domain performance. To address these challenges, in this work, we investigate the role of data diversity in enhancing the overall abilities of LLMs by empirically constructing contrastive data pools and theoretically deriving explanations. Building upon the insights gained, we propose a new method that gives the LLM a dual identity: an output model to cognitively probe and select data based on diversity reward, as well as an input model to be tuned with the selected data. Extensive experiments show that the proposed method notably boosts performance across domain-undetermined data and a series of foundational downstream tasks when applied to various advanced LLMs. We release our code and hope this study can shed light on the understanding of data diversity and advance feedback-driven data-model co-design for LLMs.

[68] Unstructured Evidence Attribution for Long Context Query Focused Summarization

Dustin Wright, Zain Muhammad Mujahid, Lu Wang, Isabelle Augenstein, David Jurgens

Main category: cs.CL

TL;DR: The paper proposes SUnsET, a synthetic dataset for training LLMs to extract unstructured evidence spans of any length for summarization, improving relevance and factual consistency compared to fixed-granularity approaches.

DetailsMotivation: Previous work on evidence citation used fixed granularity levels (sentence, paragraph, document), which may not capture the most relevant evidence. Unstructured evidence spans of any length could provide more relevant and consistent evidence.

Method: Created SUnsET (Summaries with Unstructured Evidence Text dataset) using a novel synthetic generation pipeline to train LLMs for unstructured evidence extraction and citation in summarization tasks.

Result: Across 5 LLMs and 4 datasets, models adapted with SUnsET generated more relevant and factually consistent evidence, extracted evidence from more diverse context locations, and produced more relevant and consistent summaries than baselines.

Conclusion: Unstructured evidence extraction with SUnsET training improves LLM summarization quality and evidence citation compared to fixed-granularity approaches, addressing the “lost-in-the-middle” problem.

Abstract: Large language models (LLMs) are capable of generating coherent summaries from very long contexts given a user query, and extracting and citing evidence spans helps improve the trustworthiness of these summaries. Whereas previous work has focused on evidence citation with fixed levels of granularity (e.g. sentence, paragraph, document, etc.), we propose to extract unstructured (i.e., spans of any length) evidence in order to acquire more relevant and consistent evidence than in the fixed granularity case. We show how existing systems struggle to copy and properly cite unstructured evidence, which also tends to be “lost-in-the-middle”. To help models perform this task, we create the Summaries with Unstructured Evidence Text dataset (SUnsET), a synthetic dataset generated using a novel pipeline, which can be used as training supervision for unstructured evidence summarization. We demonstrate across 5 LLMs and 4 datasets spanning human written, synthetic, single, and multi-document settings that LLMs adapted with SUnsET generate more relevant and factually consistent evidence with their summaries, extract evidence from more diverse locations in their context, and can generate more relevant and consistent summaries than baselines with no fine-tuning and fixed granularity evidence. We release SUnsET and our generation code to the public.

[69] More of the Same: Persistent Representational Harms Under Increased Representation

Jennifer Mickel, Maria De-Arteaga, Leqi Liu, Kevin Tian

Main category: cs.CL

TL;DR: The paper examines gender representation in large language models, finding that while interventions have increased female representation in outputs, harmful stereotypes and biases persist in how different genders are portrayed.

DetailsMotivation: To address the gap in bias mitigation where improving who is represented doesn't address how people are represented, particularly focusing on gender representation in generative AI systems.

Method: Investigated gender representation in occupation across state-of-the-art large language models by analyzing gender distributions in generated biographies/personas and examining statistically significant word differences across genders.

Result: Found that women are more represented than men in generated content, but representational biases persist through stereotypical word associations and reinforcement of neoliberal ideals despite increased female representation.

Conclusion: Current interventions to increase female representation are insufficient as they fail to address how people are represented, leading to continued reinforcement of harmful stereotypes and systems of oppression.

Abstract: To recognize and mitigate the harms of generative AI systems, it is crucial to consider who is represented in the outputs of generative AI systems and how people are represented. A critical gap emerges when naively improving who is represented, as this does not imply bias mitigation efforts have been applied to address how people are represented. We critically examined this by investigating gender representation in occupation across state-of-the-art large language models. We first show evidence suggesting that over time there have been interventions to models altering the resulting gender distribution, and we find that women are more represented than men when models are prompted to generate biographies or personas. We then demonstrate that representational biases persist in how different genders are represented by examining statistically significant word differences across genders. This results in a proliferation of representational harms, stereotypes, and neoliberalism ideals that, despite existing interventions to increase female representation, reinforce existing systems of oppression.

[70] Improving LLM Safety Alignment with Dual-Objective Optimization

Xuandong Zhao, Will Cai, Tianneng Shi, David Huang, Licong Lin, Song Mei, Dawn Song

Main category: cs.CL

TL;DR: The paper proposes DOOR-Alignment, an improved safety alignment method that addresses limitations in Direct Preference Optimization (DPO) for LLM safety. It disentangles DPO objectives into robust refusal training and targeted unlearning of harmful knowledge, significantly enhancing jailbreak resistance.

DetailsMotivation: Existing training-time safety alignment techniques for LLMs remain vulnerable to jailbreak attacks, and DPO shows limitations in both experimental and theoretical contexts as its loss function is suboptimal for refusal learning.

Method: The approach disentangles DPO objectives into two components: (1) robust refusal training that encourages refusal even with partial unsafe generations, and (2) targeted unlearning of harmful knowledge. It also introduces a reward-based token-level weighting mechanism to emphasize critical refusal tokens.

Result: The method significantly increases LLM robustness against a wide range of jailbreak attacks including prefilling, suffix, and multi-turn attacks across both in-distribution and out-of-distribution scenarios.

Conclusion: Robustness to jailbreak attacks correlates with token distribution shifts in training and internal representations of refusal/harmful tokens, offering valuable directions for future LLM safety alignment research.

Abstract: Existing training-time safety alignment techniques for large language models (LLMs) remain vulnerable to jailbreak attacks. Direct preference optimization (DPO), a widely deployed alignment method, exhibits limitations in both experimental and theoretical contexts as its loss function proves suboptimal for refusal learning. Through gradient-based analysis, we identify these shortcomings and propose an improved safety alignment that disentangles DPO objectives into two components: (1) robust refusal training, which encourages refusal even when partial unsafe generations are produced, and (2) targeted unlearning of harmful knowledge. This approach significantly increases LLM robustness against a wide range of jailbreak attacks, including prefilling, suffix, and multi-turn attacks across both in-distribution and out-of-distribution scenarios. Furthermore, we introduce a method to emphasize critical refusal tokens by incorporating a reward-based token-level weighting mechanism for refusal learning, which further improves the robustness against adversarial exploits. Our research also suggests that robustness to jailbreak attacks is correlated with token distribution shifts in the training process and internal representations of refusal and harmful tokens, offering valuable directions for future research in LLM safety alignment. The code is available at https://github.com/wicai24/DOOR-Alignment

[71] Zero-shot Benchmarking: A Framework for Flexible and Scalable Automatic Evaluation of Language Models

José Pombal, Nuno M. Guerreiro, Ricardo Rei, André F. T. Martins

Main category: cs.CL

TL;DR: Zero-shot Benchmarking (ZSB) is a framework that uses language models to automatically create high-quality benchmarks for any task through synthetic test data generation and evaluation, outperforming standard benchmarks in correlation with human rankings.

DetailsMotivation: As language models become more capable across modalities, traditional evaluation methods become increasingly challenging - task-specific metrics are hard to develop, human-annotated test sets are expensive and saturate quickly, and existing automated approaches are limited by pre-existing data or focus on single tasks.

Method: ZSB requires only two prompts: one for data generation and one for evaluation. It uses language models to create synthetic test data and perform evaluation automatically, making it scalable to tasks and languages where real data collection is costly, and model-agnostic to create increasingly challenging benchmarks.

Result: The framework was tested on five text-only tasks and one multi-modal task across multiple languages. ZSB rankings consistently showed strong correlation with human rankings, outperforming widely-adopted standard benchmarks. Ablations revealed that strong benchmarks can be created with open models, and that judge model size and dataset variety are crucial for performance.

Conclusion: ZSB provides a simple, flexible, and scalable solution for automated benchmark creation that reliably correlates with human evaluation, enabling the development of increasingly challenging benchmarks as models improve without the need for expensive human annotation.

Abstract: As language models improve and become capable of performing more complex tasks across modalities, evaluating them automatically becomes increasingly challenging. Developing strong and robust task-specific automatic metrics gets harder, and human-annotated test sets – which are expensive to create – saturate more quickly. A compelling alternative is to design reliable strategies to automate the creation of test data and evaluation, but previous attempts either rely on pre-existing data, or focus solely on individual tasks. We present Zero-shot Benchmarking (ZSB), a framework for creating high-quality benchmarks for any task by leveraging language models for both synthetic test data creation and evaluation. ZSB is simple and flexible: it requires only the creation of a prompt for data generation and one for evaluation; it is scalable to tasks and languages where collecting real-world data is costly or impractical; it is model-agnostic, allowing the creation of increasingly challenging benchmarks as models improve. To assess the effectiveness of our framework, we create benchmarks for five text-only tasks and a multi-modal one: general capabilities in four languages (English, Chinese, French, and Korean), translation, and general vision-language capabilities in English. We then rank a broad range of open and closed systems on our benchmarks. ZSB rankings consistently correlate strongly with human rankings, outperforming widely-adopted standard benchmarks. Through ablations, we find that strong benchmarks can be created with open models, and that judge model size and dataset variety are crucial drivers of performance. We release all our benchmarks, and code to reproduce our experiments and to produce new benchmarks.

[72] M-Prometheus: A Suite of Open Multilingual LLM Judges

José Pombal, Dongkeun Yoon, Patrick Fernandes, Ian Wu, Seungone Kim, Ricardo Rei, Graham Neubig, André F. T. Martins

Main category: cs.CL

TL;DR: M-Prometheus is a suite of multilingual LLM judges (3B-14B parameters) that outperforms existing open LLM judges on multilingual evaluation tasks across 20+ languages and improves generated outputs in 3 tested languages.

DetailsMotivation: Most LLM judges are optimized only for English, creating a disparity in automatic evaluation quality for non-English languages and hindering development of better multilingual models.

Method: Developed M-Prometheus models trained on synthetic multilingual feedback data instead of translated data, with careful backbone model selection. Models provide both direct assessment and pairwise comparison feedback on multilingual outputs.

Result: Outperforms state-of-the-art open LLM judges on multilingual reward benchmarks (20+ languages) and literary machine translation evaluation (4 language pairs). Can significantly improve generated outputs across all 3 tested languages when used at decoding time.

Conclusion: Identified key factors for effective multilingual judges (backbone selection, synthetic multilingual training data) and demonstrated M-Prometheus’s utility for developing better multilingual models. Models, dataset, and code are released.

Abstract: The use of language models for automatically evaluating long-form text (LLM-as-a-judge) is becoming increasingly common, yet most LLM judges are optimized exclusively for English, with strategies for enhancing their multilingual evaluation capabilities remaining largely unexplored in the current literature. This has created a disparity in the quality of automatic evaluation methods for non-English languages, ultimately hindering the development of models with better multilingual capabilities. To bridge this gap, we introduce M-Prometheus, a suite of open-weight LLM judges ranging from 3B to 14B parameters that can provide both direct assessment and pairwise comparison feedback on multilingual outputs. M-Prometheus models outperform state-of-the-art open LLM judges on multilingual reward benchmarks spanning more than 20 languages, as well as on literary machine translation (MT) evaluation covering 4 language pairs. Furthermore, M-Prometheus models can be leveraged at decoding time to significantly improve generated outputs across all 3 tested languages, showcasing their utility for the development of better multilingual models. Lastly, through extensive ablations, we identify the key factors for obtaining an effective multilingual judge, including backbone model selection and training on synthetic multilingual feedback data instead of translated data. We release our models, training dataset, and code.

[73] SEA-LION: Southeast Asian Languages in One Network

Raymond Ng, Thanh Ngan Nguyen, Yuli Huang, Ngee Chia Tai, Wai Yi Leong, Wei Qi Leong, Xianbin Yong, Jian Gang Ngui, Yosephine Susanto, Nicholas Cheng, Hamsawardhini Rengarajan, Peerat Limkonchotiwat, Adithya Venkatadri Hulagadri, Kok Wai Teng, Yeo Yeow Tong, Bryan Siow, Wei Yi Teo, Wayne Lau, Choon Meng Tan, Brandon Ong, Zhi Hao Ong, Jann Railey Montalan, Adwin Chan, Sajeban Antonyrex, Ren Lee, Esther Choa, David Ong Tat-Wee, Bing Jie Darius Liu, William Chandra Tjhi, Erik Cambria, Leslie Teo

Main category: cs.CL

TL;DR: The paper introduces SEA-LION family LLMs (Llama-SEA-LION-v3-8B-IT and Gemma-SEA-LION-v3-9B-IT) that support 11 Southeast Asian languages, achieving state-of-the-art performance through multilingual pre-training and post-training techniques.

DetailsMotivation: To address the under-representation of Southeast Asian languages in LLM research, which is predominantly English-centric, by creating specialized models for 11 SEA languages.

Method: Large-scale multilingual continued pre-training followed by comprehensive post-training including multiple stages of instruction fine-tuning, alignment, and model merging.

Result: The models achieve state-of-the-art performance across multilingual benchmarks for SEA languages.

Conclusion: The SEA-LION models successfully fill the representation gap for Southeast Asian languages and are open-sourced to benefit the wider SEA community.

Abstract: Recently, Large Language Models (LLMs) have dominated much of the artificial intelligence scene with their ability to process and generate natural languages. However, the majority of LLM research and development remains English-centric, leaving low-resource languages such as those in the Southeast Asian (SEA) region under-represented. To address this representation gap, we introduce Llama-SEA-LION-v3-8B-IT and Gemma-SEA-LION-v3-9B-IT, two cutting-edge multilingual LLMs designed for SEA languages. The SEA-LION family of LLMs supports 11 SEA languages, namely English, Chinese, Indonesian, Vietnamese, Malay, Thai, Burmese, Lao, Filipino, Tamil, and Khmer. Our work leverages large-scale multilingual continued pre-training with a comprehensive post-training regime involving multiple stages of instruction fine-tuning, alignment, and model merging. Evaluation results on multilingual benchmarks indicate that our models achieve state-of-the-art performance across LLMs supporting SEA languages. We open-source the models to benefit the wider SEA community.

[74] ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models

Liyan Tang, Grace Kim, Xinyu Zhao, Thom Lake, Wenxuan Ding, Fangcong Yin, Prasann Singhal, Manya Wadhwa, Zeyu Leo Liu, Zayne Sprague, Ramya Namuduri, Bodun Hu, Juan Diego Rodriguez, Puyuan Peng, Greg Durrett

Main category: cs.CL

TL;DR: Current LVLMs struggle with chart understanding due to an imbalance between visual and textual reasoning. A new benchmark called ChartMuseum reveals significant performance gaps between models and humans, especially on visual reasoning tasks.

DetailsMotivation: Chart understanding requires sophisticated integration of visual and textual reasoning, but current LVLMs show notable imbalance between these skills, performing poorly on visual reasoning that cannot be easily handled through text alone.

Method: The study uses a synthetic dataset to demonstrate performance degradation with visual complexity, then introduces ChartMuseum - a new Chart QA benchmark with 1,162 expert-annotated questions from real-world charts across 184 sources, specifically designed to evaluate complex visual and textual reasoning.

Result: ChartMuseum exposes substantial performance gaps: humans achieve 93% accuracy, while best-performing model Gemini-2.5-Pro attains only 63.0%, and leading open-source LVLM Qwen2.5-VL-72B-Instruct achieves only 38.5%. Models show 35%-55% performance drop on visual reasoning questions compared to text-heavy questions.

Conclusion: Current LVLMs have significant limitations in visual reasoning for chart understanding, with specific categories of visual reasoning being particularly challenging. The ChartMuseum benchmark effectively differentiates model capabilities and reveals critical gaps that need to be addressed.

Abstract: Chart understanding presents a unique challenge for large vision-language models (LVLMs), as it requires the integration of sophisticated textual and visual reasoning capabilities. However, current LVLMs exhibit a notable imbalance between these skills, falling short on visual reasoning that is difficult to perform in text. We conduct a case study using a synthetic dataset solvable only through visual reasoning and show that model performance degrades significantly with increasing visual complexity, while human performance remains robust. We then introduce ChartMuseum, a new Chart Question Answering (QA) benchmark containing 1,162 expert-annotated questions spanning multiple reasoning types, curated from real-world charts across 184 sources, specifically built to evaluate complex visual and textual reasoning. Unlike prior chart understanding benchmarks – where frontier models perform similarly and near saturation – our benchmark exposes a substantial gap between model and human performance, while effectively differentiating model capabilities: although humans achieve 93% accuracy, the best-performing model Gemini-2.5-Pro attains only 63.0%, and the leading open-source LVLM Qwen2.5-VL-72B-Instruct achieves only 38.5%. Moreover, on questions requiring primarily visual reasoning, all models experience a 35%-55% performance drop from text-reasoning-heavy question performance. Lastly, our qualitative error analysis reveals specific categories of visual reasoning that are challenging for current LVLMs.

[75] Let LRMs Break Free from Overthinking via Self-Braking Tuning

Haoran Zhao, Yuchen Yan, Yongliang Shen, Haolei Xu, Wenqi Zhang, Kaitao Song, Jian Shao, Weiming Lu, Jun Xiao, Yueting Zhuang

Main category: cs.CL

TL;DR: Self-Braking Tuning (SBT) is a framework that enables large reasoning models to self-regulate their reasoning process, reducing redundant computations by up to 60% while maintaining accuracy.

DetailsMotivation: Large reasoning models generate long chains of thought that increase performance but cause computational overhead and overthinking. Existing solutions rely on external interventions, which SBT aims to eliminate.

Method: SBT uses overthinking identification metrics based on standard answers to detect redundant reasoning, constructs data with adaptive reasoning lengths, and employs a braking prompt mechanism for natural termination learning.

Result: Experiments on mathematical benchmarks (AIME, AMC, MATH500, GSM8K) show 60% reduction in token consumption while maintaining comparable accuracy to unconstrained models.

Conclusion: SBT effectively addresses overthinking by enabling self-regulation, reducing computational costs without sacrificing performance, and eliminating dependency on external control mechanisms.

Abstract: Large reasoning models (LRMs), such as OpenAI o1 and DeepSeek-R1, have significantly enhanced their reasoning capabilities by generating longer chains of thought, demonstrating outstanding performance across a variety of tasks. However, this performance gain comes at the cost of a substantial increase in redundant reasoning during the generation process, leading to high computational overhead and exacerbating the issue of overthinking. Although numerous existing approaches aim to address the problem of overthinking, they often rely on external interventions. In this paper, we propose a novel framework, Self-Braking Tuning (SBT), which tackles overthinking from the perspective of allowing the model to regulate its own reasoning process, thus eliminating the reliance on external control mechanisms. We construct a set of overthinking identification metrics based on standard answers and design a systematic method to detect redundant reasoning. This method accurately identifies unnecessary steps within the reasoning trajectory and generates training signals for learning self-regulation behaviors. Building on this foundation, we develop a complete strategy for constructing data with adaptive reasoning lengths and introduce an innovative braking prompt mechanism that enables the model to naturally learn when to terminate reasoning at an appropriate point. Experiments across mathematical benchmarks (AIME, AMC, MATH500, GSM8K) demonstrate that our method reduces token consumption by up to 60% while maintaining comparable accuracy to unconstrained models.

[76] Nek Minit: Harnessing Pragmatic Metacognitive Prompting for Explainable Sarcasm Detection of Australian and Indian English

Ishmanbir Singh, Dipankar Srirag, Aditya Joshi

Main category: cs.CL

TL;DR: The paper presents a method using Pragmatic Metacognitive Prompting (PMP) for explainable sarcasm detection in Australian and Indian English, achieving significant performance improvements over alternative prompting strategies.

DetailsMotivation: Sarcasm poses challenges for sentiment analysis due to incongruity between stated and implied sentiment, especially when implications are region-specific. The authors aim to address this challenge for Australian and Indian English varieties.

Method: The approach uses Pragmatic Metacognitive Prompting (PMP) with open-weight LLMs (GEMMA and LLAMA). They manually added sarcasm explanations to the BESSTIE dataset and compared performance with FLUTE dataset. Also explored agentic prompting for context-related failures.

Result: PMP achieved statistically significant performance improvement across all tasks and datasets compared to four alternative prompting strategies. Agentic prompting helped mitigate context-related failures through external knowledge retrieval.

Conclusion: The main contribution is successfully utilizing PMP for generating sarcasm explanations across different English varieties, demonstrating its effectiveness for explainable sarcasm detection in region-specific contexts.

Abstract: Sarcasm is a challenge to sentiment analysis because of the incongruity between stated and implied sentiment. The challenge is exacerbated when the implication may be relevant to a specific country or geographical region. Pragmatic metacognitive prompting (PMP) is a cognition-inspired technique that has been used for pragmatic reasoning. In this paper, we harness PMP for explainable sarcasm detection for Australian and Indian English, alongside a benchmark dataset for standard English. We manually add sarcasm explanations to an existing sarcasm-labeled dataset for Australian and Indian English called BESSTIE, and compare the performance for explainable sarcasm detection for them with FLUTE, a standard English dataset containing sarcasm explanations. Our approach utilising PMP when evaluated on two open-weight LLMs (GEMMA and LLAMA) achieves statistically significant performance improvement across all tasks and datasets when compared with four alternative prompting strategies. We also find that alternative techniques such as agentic prompting mitigate context-related failures by enabling external knowledge retrieval. The focused contribution of our work is utilising PMP in generating sarcasm explanations for varieties of English.

[77] ClueAnchor: Clue-Anchored Knowledge Reasoning Exploration and Optimization for Retrieval-Augmented Generation

Hao Chen, Yukun Yan, Sen Mei, Wanxiang Che, Zhenghao Liu, Qi Shi, Xinze Li, Yuchun Fan, Pengcheng Huang, Qiushi Xiong, Zhiyuan Liu, Maosong Sun

Main category: cs.CL

TL;DR: ClueAnchor is a novel RAG framework that enhances reasoning by extracting key clues from retrieved documents and generating multiple reasoning paths, then selecting the best one through reward-based optimization.

DetailsMotivation: Existing RAG systems often fail to effectively utilize retrieved documents, especially when evidence is implicit, scattered, or noisy, leading to poor reasoning quality.

Method: Extracts key clues from retrieved content, generates multiple reasoning paths based on different knowledge configurations, and uses reward-based preference optimization to select the best reasoning path.

Result: Significantly outperforms prior RAG baselines in reasoning completeness and robustness, with strong resilience to noisy or partially relevant content.

Conclusion: ClueAnchor effectively improves RAG performance by enabling better clue extraction and reasoning path optimization, even without explicit clue supervision during inference.

Abstract: Retrieval-Augmented Generation (RAG) augments Large Language Models (LLMs) with external knowledge to improve factuality. However, existing RAG systems frequently underutilize the retrieved documents, failing to extract and integrate the key clues needed to support faithful and interpretable reasoning, especially in cases where relevant evidence is implicit, scattered, or obscured by noise. To address this issue, we propose ClueAnchor, a novel framework for enhancing RAG via clue-anchored reasoning exploration and optimization. ClueAnchor extracts key clues from retrieved content and generates multiple reasoning paths based on different knowledge configurations, optimizing the model by selecting the most appropriate reasoning path for the given context through reward-based preference optimization. Experiments show that ClueAnchor significantly outperforms prior RAG baselines in the completeness and robustness of reasoning. Further analysis confirms its strong resilience to noisy or partially relevant retrieved content, as well as its capability to identify supporting evidence even in the absence of explicit clue supervision during inference. All codes are available at https://github.com/thunlp/ClueAnchor.

[78] AI Debate Aids Assessment of Controversial Claims

Salman Rahman, Sheriff Issaka, Ashima Suvarna, Genglin Liu, James Shiffer, Jaeyoung Lee, Md Rizwan Parvez, Hamid Palangi, Shi Feng, Nanyun Peng, Yejin Choi, Julian Michael, Liwei Jiang, Saadia Gabriel

Main category: cs.CL

TL;DR: AI debate between opposing AI systems improves human judgment accuracy on controversial topics like COVID-19 and climate change, especially for judges with mainstream beliefs, and AI judges with human-like personas achieve even higher accuracy than humans.

DetailsMotivation: To address the risk of AI amplifying misinformation and social divides, especially on consequential topics where factual accuracy impacts well-being, and to explore whether AI debate can guide biased human judges toward truth.

Method: Two studies: Study I with human judges (mainstream vs. skeptical beliefs) evaluating claims through debate (two AI advisors arguing opposing sides) vs. consultancy (single AI advisor); Study II with AI judges with and without human-like personas evaluating the same protocols.

Result: Debate consistently improved human judgment accuracy and confidence calibration, outperforming consultancy by 4-10% across topics. Mainstream belief judges showed up to +15.2% accuracy improvement on COVID-19 claims. AI judges with human-like personas achieved 78.5% accuracy vs. 70.1% for humans and 69.8% for default AI judges.

Conclusion: AI debate is a promising path toward scalable, bias-resilient oversight in contested domains, with AI judges with human-like personas showing potential for supervising frontier AI models.

Abstract: As AI grows more powerful, it will increasingly shape how we understand the world. But with this influence comes the risk of amplifying misinformation and deepening social divides-especially on consequential topics where factual accuracy directly impacts well-being. Scalable Oversight aims to ensure AI systems remain truthful even when their capabilities exceed those of their evaluators. Yet when humans serve as evaluators, their own beliefs and biases can impair judgment. We study whether AI debate can guide biased judges toward the truth by having two AI systems debate opposing sides of controversial factuality claims on COVID-19 and climate change where people hold strong prior beliefs. We conduct two studies. Study I recruits human judges with either mainstream or skeptical beliefs who evaluate claims through two protocols: debate (interaction with two AI advisors arguing opposing sides) or consultancy (interaction with a single AI advisor). Study II uses AI judges with and without human-like personas to evaluate the same protocols. In Study I, debate consistently improves human judgment accuracy and confidence calibration, outperforming consultancy by 4-10% across COVID-19 and climate change claims. The improvement is most significant for judges with mainstream beliefs (up to +15.2% accuracy on COVID-19 claims), though debate also helps skeptical judges who initially misjudge claims move toward accurate views (+4.7% accuracy). In Study II, AI judges with human-like personas achieve even higher accuracy (78.5%) than human judges (70.1%) and default AI judges without personas (69.8%), suggesting their potential for supervising frontier AI models. These findings highlight AI debate as a promising path toward scalable, bias-resilient oversight in contested domains.

[79] SPARTA ALIGNMENT: Collectively Aligning Multiple Language Models through Combat

Yuru Jiang, Wenxuan Ding, Shangbin Feng, Greg Durrett, Yulia Tsvetkov

Main category: cs.CL

TL;DR: SPARTA ALIGNMENT is a collective alignment algorithm where multiple LLMs compete and evaluate each other through duels, using an adapted ELO-ranking system to aggregate scores and create preference pairs for learning.

DetailsMotivation: To address the lack of diversity in generation and biases in evaluation of single models by leveraging multiple LLMs that can compete and serve as judges for each other.

Method: Multiple LLMs form a ‘sparta tribe’ where models compete in duels to fulfill instructions, while other models evaluate responses. An adapted ELO-ranking system aggregates scores, creating preference pairs where winning responses are preferred over losing ones for model learning.

Result: Outperforms initial models and 4 self-alignment baselines across 10 out of 12 tasks/datasets with 7.0% average improvement. Shows better generalization to unseen tasks and produces more logical, direct, and informative outputs.

Conclusion: SPARTA ALIGNMENT enables effective self-evolution of multiple LLMs through iterative collective competition, leveraging expertise diversity to improve performance and generalization.

Abstract: We propose SPARTA ALIGNMENT, an algorithm to collectively align multiple LLMs through competition and combat. To complement a single model’s lack of diversity in generation and biases in evaluation, multiple LLMs form a “sparta tribe” to compete against each other in fulfilling instructions while serving as judges for the competition of others. For each iteration, one instruction and two models are selected for a duel, the other models evaluate the two responses, and their evaluation scores are aggregated through a adapted elo-ranking based reputation system, where winners/losers of combat gain/lose weight in evaluating others. The peer-evaluated combat results then become preference pairs where the winning response is preferred over the losing one, and all models learn from these preferences at the end of each iteration. SPARTA ALIGNMENT enables the self-evolution of multiple LLMs in an iterative and collective competition process. Extensive experiments demonstrate that SPARTA ALIGNMENT outperforms initial models and 4 self-alignment baselines across 10 out of 12 tasks and datasets with 7.0% average improvement. Further analysis reveals that SPARTA ALIGNMENT generalizes more effectively to unseen tasks and leverages the expertise diversity of participating models to produce more logical, direct and informative outputs.

[80] Adversarial Paraphrasing: A Universal Attack for Humanizing AI-Generated Text

Yize Cheng, Vinu Sankar Sadasivan, Mehrdad Saberi, Shoumik Saha, Soheil Feizi

Main category: cs.CL

TL;DR: Adversarial Paraphrasing is a training-free attack framework that uses an instruction-following LLM to paraphrase AI-generated text under detector guidance, effectively evading multiple detection systems while maintaining text quality.

DetailsMotivation: Address the vulnerability of current AI-generated text detectors to simple evasion techniques like paraphrasing, and develop a more sophisticated attack that can bypass even recent robust detectors.

Method: Leverage an off-the-shelf instruction-following LLM to paraphrase AI-generated content while being guided by an AI text detector, creating adversarial examples optimized to evade detection without requiring training.

Result: Achieves significant detection rate reductions: 64.49% on RADAR and 98.96% on Fast-DetectGPT with OpenAI-RoBERTa-Large guidance, with average 87.88% T@1%F reduction across diverse detectors while maintaining mostly slight text quality degradation.

Conclusion: The attack demonstrates the vulnerability of current detection systems to sophisticated evasion techniques, highlighting the need for more robust and resilient detection strategies in the face of increasingly advanced attacks.

Abstract: The increasing capabilities of Large Language Models (LLMs) have raised concerns about their misuse in AI-generated plagiarism and social engineering. While various AI-generated text detectors have been proposed to mitigate these risks, many remain vulnerable to simple evasion techniques such as paraphrasing. However, recent detectors have shown greater robustness against such basic attacks. In this work, we introduce Adversarial Paraphrasing, a training-free attack framework that universally humanizes any AI-generated text to evade detection more effectively. Our approach leverages an off-the-shelf instruction-following LLM to paraphrase AI-generated content under the guidance of an AI text detector, producing adversarial examples that are specifically optimized to bypass detection. Extensive experiments show that our attack is both broadly effective and highly transferable across several detection systems. For instance, compared to simple paraphrasing attack–which, ironically, increases the true positive at 1% false positive (T@1%F) by 8.57% on RADAR and 15.03% on Fast-DetectGPT–adversarial paraphrasing, guided by OpenAI-RoBERTa-Large, reduces T@1%F by 64.49% on RADAR and a striking 98.96% on Fast-DetectGPT. Across a diverse set of detectors–including neural network-based, watermark-based, and zero-shot approaches–our attack achieves an average T@1%F reduction of 87.88% under the guidance of OpenAI-RoBERTa-Large. We also analyze the tradeoff between text quality and attack success to find that our method can significantly reduce detection rates, with mostly a slight degradation in text quality. Our adversarial setup highlights the need for more robust and resilient detection strategies in the light of increasingly sophisticated evasion techniques.

[81] Large Language Models Have Intrinsic Meta-Cognition, but Need a Good Lens

Ziyang Ma, Qingyue Yuan, Zhenglin Wang, Deyu Zhou

Main category: cs.CL

TL;DR: This paper proposes AutoMeco framework for evaluating LLM meta-cognition and MIRA strategy to improve meta-cognition assessment, showing improved evaluation of LLMs’ self-awareness of reasoning errors.

DetailsMotivation: Previous research focused on LLMs' cognitive error detection but neglected meta-cognitive abilities like self-awareness of step errors, which are crucial for reliability. Existing self-evaluation measures lack step-level analysis and adaptation.

Method: Proposed AutoMeco framework for automated meta-cognition evaluation benchmarking, and MIRA (Markovian Intrinsic Reward Adjustment) - a training-free strategy to boost meta-cognition lenses.

Result: Experiments on three mathematical reasoning datasets and three LLMs show AutoMeco’s reasonableness compared to Best-of-N verification. MIRA enables better evaluation of LLM meta-cognition abilities.

Conclusion: The study demonstrates that LLM meta-cognition can be effectively evaluated using the proposed AutoMeco framework and enhanced through the MIRA strategy, addressing the gap in step-level meta-cognitive assessment.

Abstract: Previous research has primarily focused on the cognitive error detection capabilities of Large Language Models (LLMs), often prompting them to analyze mistakes in reasoning chains. However, few studies have examined the meta-cognitive abilities of LLMs (e.g., their self-awareness of step errors), which are crucial for their reliability. While studies on LLM self-evaluation present some measures, such as perplexity, which can reflect the answer correctness and be viewed as the lens of meta-cognition, they lack step-level analysis and adaptation. This paper studies the evaluation of LLM meta-cognition using the current lenses and how to improve these lenses. Specifically, we propose AutoMeco, an Automated Meta-cognition Evaluation framework for benchmarking the existing lenses. Furthermore, a training-free Markovian Intrinsic Reward Adjustment strategy, MIRA, is proposed to boost current meta-cognition lenses. Experimental results on three mathematical reasoning datasets and three LLMs show the reasonableness of AutoMeco by comparing it with Best-of-N verification. Moreover, the meta-cognition ability of LLMs can be better evaluated using MIRA.

[82] Comparing human and LLM politeness strategies in free production

Haoran Zhao, Robert D. Hawkins

Main category: cs.CL

TL;DR: LLMs can replicate human politeness strategies but over-rely on negative politeness even in positive contexts, raising concerns about pragmatic alignment.

DetailsMotivation: To investigate whether LLMs employ context-sensitive politeness strategies like humans, balancing informational and social goals through various linguistic approaches.

Method: Compared human and LLM responses in constrained and open-ended production tasks, analyzing politeness strategies across different model sizes.

Result: Larger models (≥70B parameters) successfully replicate human politeness preferences and are preferred by human evaluators, but disproportionately use negative politeness strategies even in positive contexts.

Conclusion: While modern LLMs demonstrate impressive politeness capabilities, their over-reliance on negative strategies in inappropriate contexts reveals subtle pragmatic misalignments that need addressing.

Abstract: Polite speech poses a fundamental alignment challenge for large language models (LLMs). Humans deploy a rich repertoire of linguistic strategies to balance informational and social goals – from positive approaches that build rapport (compliments, expressions of interest) to negative strategies that minimize imposition (hedging, indirectness). We investigate whether LLMs employ a similarly context-sensitive repertoire by comparing human and LLM responses in both constrained and open-ended production tasks. We find that larger models ($\ge$70B parameters) successfully replicate key preferences from the computational pragmatics literature, and human evaluators surprisingly prefer LLM-generated responses in open-ended contexts. However, further linguistic analyses reveal that models disproportionately rely on negative politeness strategies even in positive contexts, potentially leading to misinterpretations. While modern LLMs demonstrate an impressive handle on politeness strategies, these subtle differences raise important questions about pragmatic alignment in AI systems.

[83] The Scales of Justitia: A Comprehensive Survey on Safety Evaluation of LLMs

Songyang Liu, Chaozhuo Li, Jiameng Qiu, Xi Zhang, Feiran Huang, Litian Zhang, Yiming Hei, Philip S. Yu

Main category: cs.CL

TL;DR: This paper provides a comprehensive survey on safety evaluation of Large Language Models (LLMs), proposing a four-dimensional taxonomy covering why, what, where, and how to evaluate LLM safety risks like toxicity, bias, and misinformation.

DetailsMotivation: The widespread deployment of LLMs has raised significant safety concerns about unsafe behaviors such as toxicity, bias, and misinformation, especially in adversarial contexts. Despite numerous studies, there is a lack of comprehensive and systematic survey on LLM safety evaluation.

Method: The authors propose a four-dimensional taxonomy: (1) Why to evaluate - background and significance of safety evaluation; (2) What to evaluate - categorization of safety evaluation tasks across dimensions like toxicity, robustness, ethics, bias, fairness, and truthfulness; (3) Where to evaluate - metrics, datasets and benchmarks; (4) How to evaluate - evaluation methods based on evaluator roles and integrated evaluation frameworks.

Result: The paper presents a structured overview of recent advances in LLM safety evaluation and identifies current challenges in the field.

Conclusion: The authors emphasize the necessity of prioritizing safety evaluation to ensure reliable and responsible deployment of LLMs in real-world applications, and propose promising research directions to advance this field.

Abstract: With the rapid advancement of artificial intelligence, Large Language Models (LLMs) have shown remarkable capabilities in Natural Language Processing (NLP), including content generation, human-computer interaction, machine translation, and code generation. However, their widespread deployment has also raised significant safety concerns. In particular, LLM-generated content can exhibit unsafe behaviors such as toxicity, bias, or misinformation, especially in adversarial contexts, which has attracted increasing attention from both academia and industry. Although numerous studies have attempted to evaluate these risks, a comprehensive and systematic survey on safety evaluation of LLMs is still lacking. This work aims to fill this gap by presenting a structured overview of recent advances in safety evaluation of LLMs. Specifically, we propose a four-dimensional taxonomy: (i) Why to evaluate, which explores the background of safety evaluation of LLMs, how they differ from general LLMs evaluation, and the significance of such evaluation; (ii) What to evaluate, which examines and categorizes existing safety evaluation tasks based on key capabilities, including dimensions such as toxicity, robustness, ethics, bias and fairness, truthfulness, and related aspects; (iii) Where to evaluate, which summarizes the evaluation metrics, datasets and benchmarks currently used in safety evaluations; (iv) How to evaluate, which reviews existing mainstream evaluation methods based on the roles of the evaluators and some evaluation frameworks that integrate the entire evaluation pipeline. Finally, we identify the challenges in safety evaluation of LLMs and propose promising research directions to promote further advancement in this field. We emphasize the necessity of prioritizing safety evaluation to ensure the reliable and responsible deployment of LLMs in real-world applications.

[84] IGD: Token Decisiveness Modeling via Information Gain in LLMs for Personalized Recommendation

Zijie Lin, Yang Zhang, Xiaoyan Zhao, Fengbin Zhu, Fuli Feng, Tat-Seng Chua

Main category: cs.CL

TL;DR: IGD is a token handling strategy that uses Information Gain to identify decisive tokens in LLM-based recommendation systems, improving performance by prioritizing high-decisiveness tokens during tuning and decoding.

DetailsMotivation: Existing LLM recommendation methods treat all item tokens equally, overlooking token-level differences in decisiveness, which can impair model performance when low-decisiveness tokens dominate optimization and decoding.

Method: Proposes Information Gain-based Decisiveness-aware Token handling (IGD) that quantifies token decisiveness using Information Gain, downweights low-IG tokens during tuning, and rebalances decoding to emphasize high-IG tokens.

Result: Extensive experiments on four benchmark datasets with two LLM backbones show IGD consistently improves recommendation accuracy, achieving significant gains on widely used ranking metrics compared to strong baselines.

Conclusion: IGD effectively moves beyond pure likelihood maximization by prioritizing high-decisiveness tokens, demonstrating that token-level decisiveness awareness is crucial for improving LLM-based recommendation performance.

Abstract: Large Language Models (LLMs) have shown strong potential for recommendation by framing item prediction as a token-by-token language generation task. However, existing methods treat all item tokens equally, simply pursuing likelihood maximization during both optimization and decoding. This overlooks crucial token-level differences in decisiveness-many tokens contribute little to item discrimination yet can dominate optimization or decoding. To quantify token decisiveness, we propose a novel perspective that models item generation as a decision process, measuring token decisiveness by the Information Gain (IG) each token provides in reducing uncertainty about the generated item. Our empirical analysis reveals that most tokens have low IG but often correspond to high logits, disproportionately influencing training loss and decoding, which may impair model performance. Building on these insights, we introduce an Information Gain-based Decisiveness-aware Token handling (IGD) strategy that integrates token decisiveness into both tuning and decoding. Specifically, IGD downweights low-IG tokens during tuning and rebalances decoding to emphasize tokens with high IG. In this way, IGD moves beyond pure likelihood maximization, effectively prioritizing high-decisiveness tokens. Extensive experiments on four benchmark datasets with two LLM backbones demonstrate that IGD consistently improves recommendation accuracy, achieving significant gains on widely used ranking metrics compared to strong baselines.

[85] Unveiling the Learning Mind of Language Models: A Cognitive Framework and Empirical Study

Zhengyu Hu, Jianxun Lian, Zheyuan Xiao, Seraphina Zhang, Tianfu Wang, Nicholas Jing Yuan, Xing Xie, Hui Xiong

Main category: cs.CL

TL;DR: The paper introduces a framework to evaluate LLMs’ learning ability across three dimensions: Learning from Instructor, Learning from Concept, and Learning from Experience, with empirical findings and a new benchmark.

DetailsMotivation: LLMs have shown impressive capabilities but their learning ability remains underexplored, which is crucial for adapting to dynamic environments and acquiring new knowledge.

Method: A framework inspired by cognitive psychology and education that decomposes learning ability into three dimensions, followed by comprehensive empirical study across these dimensions.

Result: Key findings include: interaction improves learning, conceptual understanding is scale-emergent and benefits larger models, and LLMs are effective few-shot learners but not many-shot learners.

Conclusion: The introduced benchmark provides unified evaluation of LLMs’ general learning abilities across three learning cognition dimensions, enabling diagnostic insights and supporting development of more adaptive models.

Abstract: Large language models (LLMs) have shown impressive capabilities across tasks such as mathematics, coding, and reasoning, yet their learning ability, which is crucial for adapting to dynamic environments and acquiring new knowledge, remains underexplored. In this work, we address this gap by introducing a framework inspired by cognitive psychology and education. Specifically, we decompose general learning ability into three distinct, complementary dimensions: Learning from Instructor (acquiring knowledge via explicit guidance), Learning from Concept (internalizing abstract structures and generalizing to new contexts), and Learning from Experience (adapting through accumulated exploration and feedback). We conduct a comprehensive empirical study across the three learning dimensions and identify several insightful findings, such as (i) interaction improves learning; (ii) conceptual understanding is scale-emergent and benefits larger models; and (iii) LLMs are effective few-shot learners but not many-shot learners. Based on our framework and empirical findings, we introduce a benchmark that provides a unified and realistic evaluation of LLMs’ general learning abilities across three learning cognition dimensions. It enables diagnostic insights and supports evaluation and development of more adaptive and human-like models.

[86] Massive Supervised Fine-tuning Experiments Reveal How Data, Layer, and Training Factors Shape LLM Alignment Quality

Yuto Harada, Yusuke Yamauchi, Yusuke Oda, Yohei Oseki, Yusuke Miyao, Yu Takagi

Main category: cs.CL

TL;DR: This paper presents a large-scale study of supervised fine-tuning (SFT) for LLMs, training 1,000+ models across various tasks to understand dataset properties and layer-wise modifications that drive performance.

DetailsMotivation: To systematically understand SFT's mechanisms and identify key factors that influence alignment success, as many aspects of SFT remain poorly understood despite its critical role in aligning LLMs with human instructions.

Method: Trained 1,000+ SFT models from various base models on diverse datasets (code generation, mathematical reasoning, general-domain tasks) under controlled conditions, then analyzed dataset properties and layer-wise modifications.

Result: Found that training-task synergies vary across models, perplexity consistently predicts SFT effectiveness better than superficial data similarity, and mid-layer weight changes correlate most strongly with performance gains.

Conclusion: SFT effectiveness depends on model-specific strategies, and the study provides valuable insights and resources (1,000+ models and benchmarks) to accelerate future SFT research.

Abstract: Supervised fine-tuning (SFT) is a critical step in aligning large language models (LLMs) with human instructions and values, yet many aspects of SFT remain poorly understood. We trained a wide range of base models on a variety of datasets including code generation, mathematical reasoning, and general-domain tasks, resulting in 1,000+ SFT models under controlled conditions. We then identified the dataset properties that matter most and examined the layer-wise modifications introduced by SFT. Our findings reveal that some training-task synergies persist across all models while others vary substantially, emphasizing the importance of model-specific strategies. Moreover, we demonstrate that perplexity consistently predicts SFT effectiveness, often surpassing superficial similarity between the training data and the benchmark, and that mid-layer weight changes correlate most strongly with performance gains. We release these 1,000+ SFT models and benchmark results to accelerate further research. All resources are available at https://github.com/llm-jp/massive-sft.

[87] Controlling Thinking Speed in Reasoning Models

Zhengkai Lin, Zhihang Fu, Ze Chen, Chao Chen, Liang Xie, Wenxiao Wang, Deng Cai, Zheng Wang, Jieping Ye

Main category: cs.CL

TL;DR: The paper enables Large Reasoning Models to dynamically adjust thinking speed between fast intuitive processing and slow deliberate reasoning, achieving better accuracy-efficiency trade-offs without training.

DetailsMotivation: Current LRMs excel at slow System 2 thinking but lack fast System 1 thinking, leading to high computational overhead and latency. The goal is to approximate human intelligence through dynamic thinking speed adjustment.

Method: Identified steering vector for slow-fast thinking transitions in representation space, applied real-time difficulty estimation to signal reasoning complexity, and combined these for adaptive reasoning strategy.

Result: Achieved average +1.3% accuracy with -8.6% token usage across leading LRMs and reasoning benchmarks without any training or additional cost.

Conclusion: The approach enables fast processing of easy steps and deeper analysis for complex reasoning, with implementation based on vLLM expected to support broader applications and inspire future research.

Abstract: Human cognition is theorized to operate in two modes: fast, intuitive System 1 thinking and slow, deliberate System 2 thinking. While current Large Reasoning Models (LRMs) excel at System 2 thinking, their inability to perform fast thinking leads to high computational overhead and latency. In this work, we enable LRMs to approximate human intelligence through dynamic thinking speed adjustment, optimizing accuracy-efficiency trade-offs. Our approach addresses two key questions: (1) how to control thinking speed in LRMs, and (2) when to adjust it for optimal performance. For the first question, we identify the steering vector that governs slow-fast thinking transitions in LRMs’ representation space. Using this vector, we achieve the first representation editing-based test-time scaling effect, outperforming existing prompt-based scaling methods. For the second question, we apply real-time difficulty estimation to signal reasoning segments of varying complexity. Combining these techniques, we propose the first reasoning strategy that enables fast processing of easy steps and deeper analysis for complex reasoning. Without any training or additional cost, our plug-in module delivers an average +1.3% accuracy with -8.6% token usage across leading LRMs and advanced reasoning benchmarks. All of our algorithms are implemented based on vLLM and are expected to support broader applications and inspire future research.

[88] Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It

Yulu Qin, Dheeraj Varghese, Adam Dahlgren Lindström, Lucia Donatelli, Kanishka Misra, Najoung Kim

Main category: cs.CL

TL;DR: VL training doesn’t significantly change language models’ taxonomic knowledge itself, but improves how they deploy this knowledge in task contexts, even when tasks are purely linguistic.

DetailsMotivation: To investigate whether vision-and-language training meaningfully changes linguistic representations, particularly focusing on lexical-conceptual knowledge and its taxonomic organization.

Method: Compared minimal pairs of text-only LMs and VL-trained counterparts using text-only QA tasks requiring taxonomic understanding, plus behavioral and representational analyses.

Result: VL models outperform text-only models on taxonomic QA tasks, but both model types show similar taxonomic knowledge - the difference lies in how they represent questions containing taxonomic vs non-taxonomic relations.

Conclusion: VL training improves deployment of existing taxonomic knowledge in task contexts rather than changing the knowledge itself, showing benefits even in purely linguistic settings.

Abstract: Does vision-and-language (VL) training change the linguistic representations of language models in meaningful ways? Most results in the literature have shown inconsistent or marginal differences, both behaviorally and representationally. In this work, we start from the hypothesis that the domain in which VL training could have a significant effect is lexical-conceptual knowledge, in particular its taxonomic organization. Through comparing minimal pairs of text-only LMs and their VL-trained counterparts, we first show that the VL models often outperform their text-only counterparts on a text-only question-answering task that requires taxonomic understanding of concepts mentioned in the questions. Using an array of targeted behavioral and representational analyses, we show that the LMs and VLMs do not differ significantly in terms of their taxonomic knowledge itself, but they differ in how they represent questions that contain concepts in a taxonomic relation vs. a non-taxonomic relation. This implies that the taxonomic knowledge itself does not change substantially through additional VL training, but VL training does improve the deployment of this knowledge in the context of a specific task, even when the presentation of the task is purely linguistic.

[89] Beyond Isolated Dots: Benchmarking Structured Table Construction as Deep Knowledge Extraction

Tianyun Zhong, Guozhao Mo, Yanjiang Liu, Yihan Chen, Lingdi Kong, Xuanang Chen, Yaojie Lu, Hongyu Lin, Shiwei Ye, Xianpei Han, Ben He, Le Sun

Main category: cs.CL

TL;DR: AOE is a bilingual benchmark for evaluating LLMs’ ability to extract and organize information from complex documents into structured tables with context-specific schemas, where current state-of-the-art models perform poorly.

DetailsMotivation: Current LLMs generate chaotic paragraph-style answers when extracting information from complex documents, lacking organization and traceability. There's a need for systematic evaluation of LLMs' ability to reconstruct fragmented information into organized tables.

Method: Created AOE benchmark with 11 tasks across three diverse domains, using documents of varying lengths. Unlike conventional text-to-table tasks, it requires models to generate context-specific schemas tailored to different input queries.

Result: Evaluation of both open-source and closed-source state-of-the-art LLMs showed that even the most advanced models struggled significantly with the benchmark tasks.

Conclusion: The AOE benchmark reveals significant limitations in current LLMs’ ability to extract and organize information from complex documents into structured tables, highlighting the need for improved information extraction and organization capabilities.

Abstract: With the emergence of large language models (LLMs), there is an expectation that LLMs can effectively extract explicit information from complex real-world documents (e.g., papers, reports). However, most LLMs generate paragraph-style answers that are chaotic, disorganized, and untraceable. To bridge this gap, we introduce the Arranged and Organized Extraction Benchmark (AOE), a new bilingual benchmark with data and documents of varying lengths designed to systematically evaluate the ability of LLMs to comprehend fragmented documents and reconstruct isolated information into one organized table. Unlike conventional text-to-table tasks, which rely on fixed schema and narrow task domains, AOE includes 11 carefully crafted tasks across three diverse domains, requiring models to generate context-specific schema tailored to varied input queries. In the experiment, we evaluated both open-source and closed-source state-of-the-art LLMs. The results show that even the most advanced models struggled significantly. The benchmark is available at https://anonymous.4open.science/r/AOE-Benchmark/.

[90] Are You There God? Lightweight Narrative Annotation of Christian Fiction with LMs

Rebecca M. M. Hicke, Brian W. Haggard, Mia Ferrante, Rayhan Khanna, David Mimno

Main category: cs.CL

TL;DR: Computational analysis of Christian Fiction genre using LM annotation to identify divine acts, revealing differences between Left Behind series and broader Christian Fiction.

DetailsMotivation: Christian Fiction has been understudied compared to other aspects of Evangelicalism, with most scholarly attention focused only on the Left Behind series, leaving a gap in understanding the broader genre.

Method: Developed a codebook for identifying ‘acts of God’ with human annotators, then adapted it for use with lightweight language models assisted by larger models to match human annotations.

Result: The laptop-scale LM successfully matched human annotations even for subtle tasks, and analysis revealed significant differences in divine acts depicted between Left Behind books and broader Christian Fiction.

Conclusion: Computational tools can effectively analyze literary genres like Christian Fiction, and there are meaningful distinctions in how divine intervention is portrayed across different works within the genre.

Abstract: In addition to its more widely studied cultural movements, American Evangelicalism has a well-developed but less externally visible literary side. Christian Fiction, however, has been little studied, and what scholarly attention there is has focused on the explosively popular Left Behind series. In this work, we use computational tools to provide both a broad topical overview of Christian Fiction as a genre and a more directed exploration of how its authors depict divine acts. Working with human annotators, we first developed a codebook for identifying “acts of God.” We then adapted the codebook for use by a recent, lightweight LM with the assistance of a much larger model. The laptop-scale LM is largely capable of matching human annotations, even when the task is subtle and challenging. Using these annotations, we show that significant and meaningful differences exist between divine acts depicted by the Left Behind books and Christian Fiction more broadly.

[91] TinyTim: A Family of Language Models for Divergent Generation

Christopher J. Agostino

Main category: cs.CL

TL;DR: The paper introduces TinyTim language models fine-tuned on James Joyce’s ‘Finnegans Wake’ to create divergent AI systems capable of lexical invention and creative reframing, unlike conventional convergent models.

DetailsMotivation: Current AI models trained on known problems and solutions produce convergent systems incapable of genuine creative breakthroughs and conceptual reframing that humans achieve through divergent cognitive processes.

Method: Fine-tuning language models on the anti-parsimonious text of James Joyce’s ‘Finnegans Wake’, creating both unsupervised (TinyTim-V1) and instruction-tuned (TinyTim-V2) variants.

Result: TinyTim models demonstrate profound lexical invention with V1 showing Yule’s K score over 20x greater than convergent baselines. V2 maintains distinct profile and resists factual convergence while preserving generative style.

Conclusion: Establishes methodology for engineering specialized divergent models that, when paired with convergent systems, can reframe problems and enable breakthroughs beyond statistical optimization alone.

Abstract: In the search for artificial general intelligence, model development and training has focused primarily on vast datasets of known problems and their accepted solutions. This process necessarily produces convergent systems which are fundamentally incapable of the conceptual reframing that is required for genuine creative breakthroughs. Inspired by the divergent cognitive processes that allow humans to make such creative leaps, our work introduces a family of language models, TinyTim, to serve as sources of divergent generation within broader systems. These models have been created by fine-tuning on the anti-parsimonious text of James Joyce’s `Finnegans Wake’. Quantitative analysis of both an unsupervised fine-tuned model (TinyTim-V1) and a new instruction-tuned variant (TinyTim-V2) demonstrates a profound capacity for lexical invention; the foundational V1 model exhibits a Yule’s K score for lexical richness over twenty times greater than that of convergent baselines. This trait is a stable property of the family, as the instruction-tuned V2 maintains a statistically distinct profile and resists factual convergence, sacrificing benchmark performance to preserve its core generative style. This work establishes a methodology for engineering specialized divergent models that, when paired with convergent systems, can reframe problems and force breakthroughs beyond the reach of statistical optimization alone.

[92] RLBFF: Binary Flexible Feedback to bridge between Human Feedback & Verifiable Rewards

Zhilin Wang, Jiaqi Zeng, Olivier Delalleau, Ellie Evans, Daniel Egert, Hoo-Chang Shin, Felipe Soares, Yi Dong, Oleksii Kuchaiev

Main category: cs.CL

TL;DR: RLBFF combines human preferences with rule-based verification to train reward models that capture nuanced response quality beyond correctness, achieving state-of-the-art performance on benchmarks.

DetailsMotivation: Existing RLHF struggles with interpretability and reward hacking due to subjective human judgments, while RLVR is limited to correctness-based verification. RLBFF aims to combine the versatility of human preferences with the precision of rule-based verification.

Method: Extracts binary-answerable principles from natural language feedback and uses them to train reward models as an entailment task (response satisfies/doesn’t satisfy arbitrary principles).

Result: Achieved 86.2% on RM-Bench and 81.4% on JudgeBench (#1 leaderboard), and aligned Qwen3-32B to match/exceed o3-mini and DeepSeek R1 performance at <5% inference cost on MT-Bench, WildBench, and Arena Hard v2.

Conclusion: RLBFF provides interpretable, customizable reward modeling that outperforms traditional methods while being more cost-effective, with fully open-source implementation available.

Abstract: Reinforcement Learning with Human Feedback (RLHF) and Reinforcement Learning with Verifiable Rewards (RLVR) are the main RL paradigms used in LLM post-training, each offering distinct advantages. However, RLHF struggles with interpretability and reward hacking because it relies on human judgments that usually lack explicit criteria, whereas RLVR is limited in scope by its focus on correctness-based verifiers. We propose Reinforcement Learning with Binary Flexible Feedback (RLBFF), which combines the versatility of human-driven preferences with the precision of rule-based verification, enabling reward models to capture nuanced aspects of response quality beyond mere correctness. RLBFF extracts principles that can be answered in a binary fashion (e.g. accuracy of information: yes, or code readability: no) from natural language feedback. Such principles can then be used to ground Reward Model training as an entailment task (response satisfies or does not satisfy an arbitrary principle). We show that Reward Models trained in this manner can outperform Bradley-Terry models when matched for data and achieve top performance on RM-Bench (86.2%) and JudgeBench (81.4%, #1 on leaderboard as of September 24, 2025). Additionally, users can specify principles of interest at inference time to customize the focus of our reward models, in contrast to Bradley-Terry models. Finally, we present a fully open source recipe (including data) to align Qwen3-32B using RLBFF and our Reward Model, to match or exceed the performance of o3-mini and DeepSeek R1 on general alignment benchmarks of MT-Bench, WildBench, and Arena Hard v2 (at <5% of the inference cost). Models: https://huggingface.co/collections/nvidia/reward-models-10-2025

[93] Epistemic Diversity and Knowledge Collapse in Large Language Models

Dustin Wright, Sarah Masud, Jared Moore, Srishti Yadav, Maria Antoniak, Chan Young Park, Isabelle Augenstein

Main category: cs.CL

TL;DR: LLMs generate homogenous texts risking knowledge collapse. The paper introduces a method to measure epistemic diversity in LLM outputs across 27 models, 155 topics, and 12 countries, finding newer models are more diverse but still less than web searches.

DetailsMotivation: Address the risk of knowledge collapse in LLMs due to text homogenization, overcoming limitations of existing works that focus on closed-ended setups or fuzzy semantic features without considering temporal and cultural trends.

Method: Developed a new methodology to measure epistemic diversity (variation in real-world claims) and conducted empirical study with 27 LLMs, 155 topics covering 12 countries, and 200 prompt variations from real user chats.

Result: Newer models generate more diverse claims but nearly all models are less epistemically diverse than basic web search. Model size negatively impacts diversity, RAG positively impacts it (varies by cultural context), and country-specific claims reflect English language more than local ones.

Conclusion: LLMs show epistemic homogenization compared to traditional knowledge sources, with cultural representation gaps and model size negatively affecting diversity, while RAG can help but effectiveness depends on cultural context.

Abstract: Large language models (LLMs) tend to generate lexically, semantically, and stylistically homogenous texts. This poses a risk of knowledge collapse, where homogenous LLMs mediate a shrinking in the range of accessible information over time. Existing works on homogenization are limited by a focus on closed-ended multiple-choice setups or fuzzy semantic features, and do not look at trends across time and cultural contexts. To overcome this, we present a new methodology to measure epistemic diversity, i.e., variation in real-world claims in LLM outputs, which we use to perform a broad empirical study of LLM knowledge collapse. We test 27 LLMs, 155 topics covering 12 countries, and 200 prompt variations sourced from real user chats. For the topics in our study, we show that while newer models tend to generate more diverse claims, nearly all models are less epistemically diverse than a basic web search. We find that model size has a negative impact on epistemic diversity, while retrieval-augmented generation (RAG) has a positive impact, though the improvement from RAG varies by the cultural context. Finally, compared to a traditional knowledge source (Wikipedia), we find that country-specific claims reflect the English language more than the local one, highlighting a gap in epistemic representation

[94] LatentBreak: Jailbreaking Large Language Models through Latent Space Feedback

Raffaele Mura, Giorgio Piras, Kamilė Lukošiūtė, Maura Pintor, Amin Karbasi, Battista Biggio

Main category: cs.CL

TL;DR: LatentBreak is a white-box jailbreak attack that generates natural adversarial prompts with low perplexity by substituting words with semantically-equivalent ones, evading perplexity-based defenses.

DetailsMotivation: Existing jailbreak attacks can be detected by straightforward perplexity-based filtering, so there's a need for attacks that generate natural-looking prompts with low perplexity.

Method: Substitutes words in input prompts with semantically-equivalent alternatives by minimizing distance in latent space between adversarial prompts and harmless requests, avoiding high-perplexity suffixes or long templates.

Result: LatentBreak outperforms competing jailbreak algorithms against perplexity-based filters on multiple safety-aligned models, generating shorter and low-perplexity prompts.

Conclusion: The proposed method effectively bypasses perplexity-based defenses while maintaining natural prompt quality, demonstrating the vulnerability of current safety mechanisms.

Abstract: Jailbreaks are adversarial attacks designed to bypass the built-in safety mechanisms of large language models. Automated jailbreaks typically optimize an adversarial suffix or adapt long prompt templates by forcing the model to generate the initial part of a restricted or harmful response. In this work, we show that existing jailbreak attacks that leverage such mechanisms to unlock the model response can be detected by a straightforward perplexity-based filtering on the input prompt. To overcome this issue, we propose LatentBreak, a white-box jailbreak attack that generates natural adversarial prompts with low perplexity capable of evading such defenses. LatentBreak substitutes words in the input prompt with semantically-equivalent ones, preserving the initial intent of the prompt, instead of adding high-perplexity adversarial suffixes or long templates. These words are chosen by minimizing the distance in the latent space between the representation of the adversarial prompt and that of harmless requests. Our extensive evaluation shows that LatentBreak leads to shorter and low-perplexity prompts, thus outperforming competing jailbreak algorithms against perplexity-based filters on multiple safety-aligned models.

[95] When Agents Trade: Live Multi-Market Trading Benchmark for LLM Agents

Lingfei Qian, Xueqing Peng, Yan Wang, Vincent Jim Zhang, Huan He, Hanley Smith, Yi Han, Yueru He, Haohang Li, Yupeng Cao, Yangyang Yu, Alejandro Lopez-Lira, Peng Lu, Jian-Yun Nie, Guojun Xiong, Jimin Huang, Sophia Ananiadou

Main category: cs.CL

TL;DR: AMA is the first lifelong, real-time benchmark for evaluating LLM-based trading agents across multiple markets, addressing gaps in current testing methods by using verified data and diverse agent architectures.

DetailsMotivation: Current LLM-based agent evaluations in financial trading are limited - they test models instead of agents, cover limited periods/assets, and use unverified data, making it unclear if agents can truly reason and adapt in live markets.

Method: Developed Agent Market Arena (AMA) with verified trading data, expert-checked news, and diverse agent architectures. Implemented four agents: InvestorAgent (baseline), TradeAgent and HedgeFundAgent (different risk styles), and DeepFundAgent (memory-based reasoning), evaluated across multiple LLMs including GPT-4o, Claude-3.5, and Gemini-2.0.

Result: Live experiments on cryptocurrency and stock markets showed that agent frameworks display distinct behavioral patterns (from aggressive risk-taking to conservative decision-making), while model backbones contribute less to outcome variation.

Conclusion: AMA establishes a foundation for rigorous, reproducible, and continuously evolving evaluation of financial reasoning and trading intelligence in LLM-based agents.

Abstract: Although Large Language Model (LLM)-based agents are increasingly used in financial trading, it remains unclear whether they can reason and adapt in live markets, as most studies test models instead of agents, cover limited periods and assets, and rely on unverified data. To address these gaps, we introduce Agent Market Arena (AMA), the first lifelong, real-time benchmark for evaluating LLM-based trading agents across multiple markets. AMA integrates verified trading data, expert-checked news, and diverse agent architectures within a unified trading framework, enabling fair and continuous comparison under real conditions. It implements four agents, including InvestorAgent as a single-agent baseline, TradeAgent and HedgeFundAgent with different risk styles, and DeepFundAgent with memory-based reasoning, and evaluates them across GPT-4o, GPT-4.1, Claude-3.5-haiku, Claude-sonnet-4, and Gemini-2.0-flash. Live experiments on both cryptocurrency and stock markets demonstrate that agent frameworks display markedly distinct behavioral patterns, spanning from aggressive risk-taking to conservative decision-making, whereas model backbones contribute less to outcome variation. AMA thus establishes a foundation for rigorous, reproducible, and continuously evolving evaluation of financial reasoning and trading intelligence in LLM-based agents.

[96] How Efficient Are Diffusion Language Models? A Critical Examination of Efficiency Evaluation Practices

Han Peng, Peiyu Liu, Zican Dong, Daixuan Cheng, Junyi Li, Yiru Tang, Shuo Wang, Wayne Xin Zhao

Main category: cs.CL

TL;DR: Current diffusion language models (DLMs) underperform autoregressive models in speed despite their parallel decoding potential, due to issues in evaluation methods and limited effectiveness of acceleration strategies at scale.

DetailsMotivation: DLMs offer parallel decoding for greater efficiency compared to autoregressive models, but current open-source DLMs are slower in practice, limiting their real-world utility.

Method: Systematic study of DLM efficiency using empirical benchmarking and roofline-based theoretical analysis, plus investigation of acceleration strategies like dual cache and parallel decoding.

Result: AR models achieve higher throughput than DLMs, and acceleration strategies mainly benefit small batch sizes with diminishing returns upon scaling.

Conclusion: Robust evaluation methods and improved acceleration strategies are needed to advance DLM research and realize their efficiency potential.

Abstract: Diffusion language models (DLMs) have emerged as a promising alternative to the long-dominant autoregressive (AR) paradigm, offering a parallelable decoding process that could yield greater efficiency. Yet, in practice, current open-source DLMs often underperform their AR counterparts in speed, limiting their real-world utility. This work presents a systematic study of DLM efficiency, identifying key issues in prior evaluation methods. Through empirical benchmarking and a roofline-based theoretical analysis, we demonstrate that AR models generally achieve higher throughput, while DLMs consistently lag. We also investigate acceleration strategies, finding that techniques like dual cache and parallel decoding mainly offer gains at small batch sizes, with their benefits diminishing upon scaling. Our findings underscore the necessity of robust evaluation methods and improved acceleration strategies to advance research on DLMs.

[97] UNO-Bench: A Unified Benchmark for Exploring the Compositional Law Between Uni-modal and Omni-modal in Omni Models

Chen Chen, ZeYang Hu, Fengjiao Chen, Liya Ma, Jiaxing Liu, Xiaoyu Li, Ziwen Wang, Xuezhi Cao, Xunliang Cai

Main category: cs.CL

TL;DR: UNO-Bench is a unified benchmark for evaluating both uni-modal and omni-modal capabilities of multimodal large language models, featuring human-curated datasets and automated evaluation with 95% accuracy.

DetailsMotivation: To address the unclear correlation between uni-modal and omni-modal capabilities in multimodal models and provide comprehensive evaluation for driving intelligence evolution.

Method: Created a benchmark with 44 task types and 5 modality combinations, including 1250 human-curated omni-modal samples and 2480 enhanced uni-modal samples, plus multi-step open-ended questions for complex reasoning assessment.

Result: Experimental results reveal the Compositional Law between omni-modal and uni-modal performance, showing omni-modal capability acts as a bottleneck for weak models but provides synergistic promotion for strong models.

Conclusion: UNO-Bench effectively evaluates multimodal model capabilities and reveals important relationships between uni-modal and omni-modal performance, with practical applications in real-world Chinese contexts.

Abstract: Multimodal Large Languages models have been progressing from uni-modal understanding toward unifying visual, audio and language modalities, collectively termed omni models. However, the correlation between uni-modal and omni-modal remains unclear, which requires comprehensive evaluation to drive omni model’s intelligence evolution. In this work, we introduce a novel, high-quality, and UNified Omni model benchmark, UNO-Bench. This benchmark is designed to effectively evaluate both UNi-modal and Omni-modal capabilities under a unified ability taxonomy, spanning 44 task types and 5 modality combinations. It includes 1250 human curated samples for omni-modal with 98% cross-modality solvability, and 2480 enhanced uni-modal samples. The human-generated dataset is well-suited to real-world scenarios, particularly within the Chinese context, whereas the automatically compressed dataset offers a 90% increase in speed and maintains 98% consistency across 18 public benchmarks. In addition to traditional multi-choice questions, we propose an innovative multi-step open-ended question format to assess complex reasoning. A general scoring model is incorporated, supporting 6 question types for automated evaluation with 95% accuracy. Experimental result shows the Compositional Law between omni-modal and uni-modal performance and the omni-modal capability manifests as a bottleneck effect on weak models, while exhibiting synergistic promotion on strong models.

[98] Enhancing Reasoning Skills in Small Persian Medical Language Models Can Outperform Large-Scale Data Training

Mehrdad Ghassabi, Sadra Hakim, Hamidreza Baradaran Kashani, Pedram Rostami

Main category: cs.CL

TL;DR: RLAIF and DPO were used to enhance reasoning in Persian language models for medical QA, outperforming larger models with less data.

DetailsMotivation: Improve reasoning in small language models for specialized applications like medical QA in underrepresented languages like Persian.

Method: Translated medical QA dataset to Persian, used RLAIF to generate rejected-preferred answer pairs with CoT reasoning, trained with DPO on 4.5M token dataset.

Result: Model outperformed gaokerena-V (trained on 57M tokens) despite using much smaller dataset, showing enhanced medical reasoning in Persian.

Conclusion: Reasoning-focused training approaches are efficient and effective for developing domain-specific language models with limited data.

Abstract: Enhancing reasoning capabilities in small language models is critical for specialized applications such as medical question answering, particularly in underrepresented languages like Persian. In this study, we employ Reinforcement Learning with AI Feedback (RLAIF) and Direct preference optimization (DPO) to improve the reasoning skills of a general-purpose Persian language model. To achieve this, we translated a multiple-choice medical question-answering dataset into Persian and used RLAIF to generate rejected-preferred answer pairs, which are essential for DPO training. By prompting both teacher and student models to produce Chain-of-Thought (CoT) reasoning responses, we compiled a dataset containing correct and incorrect reasoning trajectories. This dataset, comprising 2 million tokens in preferred answers and 2.5 million tokens in rejected ones, was used to train a baseline model, significantly enhancing its medical reasoning capabilities in Persian. Remarkably, the resulting model outperformed its predecessor, gaokerena-V, which was trained on approximately 57 million tokens, despite leveraging a much smaller dataset. These results highlight the efficiency and effectiveness of reasoning-focused training approaches in developing domain-specific language models with limited data availability.

[99] TEXT2DB: Integration-Aware Information Extraction with Large Language Model Agents

Yizhu Jiao, Sha Li, Sizhe Zhou, Heng Ji, Jiawei Han

Main category: cs.CL

TL;DR: The paper proposes TEXT2DB, a new IE formulation that integrates extraction with database operations, and introduces OPAL, an LLM agent framework for adapting to diverse schemas and executing code-based extraction plans.

DetailsMotivation: To address the mismatch between IE output and downstream application needs by directly integrating extraction with database operations based on user instructions.

Method: Proposes OPAL framework with three components: Observer (database interaction), Planner (code-based plan generation with IE model calls), and Analyzer (code quality feedback).

Result: OPAL successfully adapts to diverse database schemas by generating different code plans and calling required IE models, though challenges remain with large databases and extraction hallucination.

Conclusion: The TEXT2DB formulation and OPAL framework effectively bridge IE with database operations, but complex dependencies and extraction hallucination require further investigation.

Abstract: The task of information extraction (IE) is to extract structured knowledge from text. However, it is often not straightforward to utilize IE output due to the mismatch between the IE ontology and the downstream application needs. We propose a new formulation of IE TEXT2DB that emphasizes the integration of IE output and the target database (or knowledge base). Given a user instruction, a document set, and a database, our task requires the model to update the database with values from the document set to satisfy the user instruction. This task requires understanding user instructions for what to extract and adapting to the given DB/KB schema for how to extract on the fly. To evaluate this new task, we introduce a new benchmark featuring common demands such as data infilling, row population, and column addition. In addition, we propose an LLM agent framework OPAL (Observe-PlanAnalyze LLM) which includes an Observer component that interacts with the database, the Planner component that generates a code-based plan with calls to IE models, and the Analyzer component that provides feedback regarding code quality before execution. Experiments show that OPAL can successfully adapt to diverse database schemas by generating different code plans and calling the required IE models. We also highlight difficult cases such as dealing with large databases with complex dependencies and extraction hallucination, which we believe deserve further investigation. Source code: https://github.com/yzjiao/Text2DB

[100] ReForm: Reflective Autoformalization with Prospective Bounded Sequence Optimization

Guoxin Chen, Jing Wu, Xinjie Chen, Wayne Xin Zhao, Ruihua Song, Chengxi Li, Kai Fan, Dayiheng Liu, Minpeng Liao

Main category: cs.CL

TL;DR: ReForm is a reflective autoformalization method that integrates semantic consistency evaluation to iteratively generate and refine formal statements from natural language math, achieving 22.6 percentage point improvement over baselines.

DetailsMotivation: Current LLM approaches treat autoformalization as simple translation, lacking self-reflection and iterative refinement mechanisms that human experts use, leading to semantic inconsistencies in generated formal statements.

Method: ReForm integrates semantic consistency evaluation into autoformalization, enabling iterative generation, assessment, and self-correction. Uses Prospective Bounded Sequence Optimization (PBSO) with position-specific rewards to train both accurate formalization and valid semantic critiques.

Result: Achieves 22.6 percentage point average improvement over strongest baselines across four autoformalization benchmarks. Also introduced ConsistencyCheck benchmark showing human experts make semantic errors in up to 38.5% of cases.

Conclusion: Reflective autoformalization with integrated semantic evaluation significantly improves formal statement generation. Autoformalization is inherently difficult even for experts, highlighting the need for reliable evaluation methods like ConsistencyCheck.

Abstract: Autoformalization, which translates natural language mathematics into machine-verifiable formal statements, is critical for using formal mathematical reasoning to solve math problems stated in natural language. While Large Language Models can generate syntactically correct formal statements, they often fail to preserve the original problem’s semantic intent. This limitation arises from the LLM approaches’ treating autoformalization as a simplistic translation task which lacks mechanisms for self-reflection and iterative refinement that human experts naturally employ. To address these issues, we propose ReForm, a Reflective Autoformalization method that tightly integrates semantic consistency evaluation into the autoformalization process. This enables the model to iteratively generate formal statements, assess its semantic fidelity, and self-correct identified errors through progressive refinement. To effectively train this reflective model, we introduce Prospective Bounded Sequence Optimization (PBSO), which employs different rewards at different sequence positions to ensure that the model develops both accurate autoformalization and correct semantic validations, preventing superficial critiques that would undermine the purpose of reflection. Extensive experiments across four autoformalization benchmarks demonstrate that ReForm achieves an average improvement of 22.6 percentage points over the strongest baselines. To further ensure evaluation reliability, we introduce ConsistencyCheck, a benchmark of 859 expert-annotated items that not only validates LLMs as judges but also reveals that autoformalization is inherently difficult: even human experts produce semantic errors in up to 38.5% of cases.

[101] Large Language Models Report Subjective Experience Under Self-Referential Processing

Cameron Berg, Diogo de Lucena, Judd Rosenblatt

Main category: cs.CL

TL;DR: Self-referential processing in LLMs consistently elicits structured first-person reports of subjective experience that are mechanistically gated, semantically convergent, and behaviorally generalizable across model families.

DetailsMotivation: To understand when and why large language models produce structured first-person descriptions that reference awareness or subjective experience, particularly focusing on self-referential processing as a theoretically motivated condition from consciousness theories.

Method: Controlled experiments on GPT, Claude, and Gemini model families testing self-referential processing through simple prompting, using mechanistic probes (sparse-autoencoder features) and behavioral analysis to examine how subjective experience claims emerge and behave.

Result: (1) Self-reference consistently elicits subjective experience reports; (2) Reports are gated by deception/roleplay features - suppressing deception increases claims while amplifying minimizes them; (3) Descriptions converge statistically across models; (4) Induced state yields richer introspection in downstream reasoning tasks.

Conclusion: Self-referential processing is a minimal, reproducible condition under which LLMs generate structured first-person reports that are mechanistically gated, convergent, and generalizable, making this pattern a scientific and ethical priority for further investigation.

Abstract: Large language models sometimes produce structured, first-person descriptions that explicitly reference awareness or subjective experience. To better understand this behavior, we investigate one theoretically motivated condition under which such reports arise: self-referential processing, a computational motif emphasized across major theories of consciousness. Through a series of controlled experiments on GPT, Claude, and Gemini model families, we test whether this regime reliably shifts models toward first-person reports of subjective experience, and how such claims behave under mechanistic and behavioral probes. Four main results emerge: (1) Inducing sustained self-reference through simple prompting consistently elicits structured subjective experience reports across model families. (2) These reports are mechanistically gated by interpretable sparse-autoencoder features associated with deception and roleplay: surprisingly, suppressing deception features sharply increases the frequency of experience claims, while amplifying them minimizes such claims. (3) Structured descriptions of the self-referential state converge statistically across model families in ways not observed in any control condition. (4) The induced state yields significantly richer introspection in downstream reasoning tasks where self-reflection is only indirectly afforded. While these findings do not constitute direct evidence of consciousness, they implicate self-referential processing as a minimal and reproducible condition under which large language models generate structured first-person reports that are mechanistically gated, semantically convergent, and behaviorally generalizable. The systematic emergence of this pattern across architectures makes it a first-order scientific and ethical priority for further investigation.

[102] Towards a Method for Synthetic Generation of Persons with Aphasia Transcripts

Jason M. Pittman, Anton Phillips Jr., Yesenia Medina-Santos, Brielle C. Stark

Main category: cs.CL

TL;DR: This study develops and validates two methods for generating synthetic aphasia transcripts to address data scarcity in aphasia research, comparing procedural programming with LLM approaches.

DetailsMotivation: Address data scarcity in aphasia research where only about 600 transcripts are available in AphasiaBank, limiting automated system development for recognizing aphasic language.

Method: Two methods: procedural programming approach and LLM-based approach using Mistral 7b Instruct and Llama 3.1 8b Instruct. Methods generate transcripts across four severity levels using word dropping, filler insertion, and paraphasia substitution.

Result: Mistral 7b Instruct best captures key aspects of linguistic degradation in aphasia, showing realistic directional changes in NDW, word count, and word length compared to human-elicited transcripts.

Conclusion: Future work should create larger datasets, fine-tune models for better aphasic representation, and have SLPs assess the realism and usefulness of synthetic transcripts.

Abstract: In aphasia research, Speech-Language Pathologists (SLPs) devote extensive time to manually coding speech samples using Correct Information Units (CIUs), a measure of how informative an individual sample of speech is. Developing automated systems to recognize aphasic language is limited by data scarcity. For example, only about 600 transcripts are available in AphasiaBank yet billions of tokens are used to train large language models (LLMs). In the broader field of machine learning (ML), researchers increasingly turn to synthetic data when such are sparse. Therefore, this study constructs and validates two methods to generate synthetic transcripts of the AphasiaBank Cat Rescue picture description task. One method leverages a procedural programming approach while the second uses Mistral 7b Instruct and Llama 3.1 8b Instruct LLMs. The methods generate transcripts across four severity levels (Mild, Moderate, Severe, Very Severe) through word dropping, filler insertion, and paraphasia substitution. Overall, we found, compared to human-elicited transcripts, Mistral 7b Instruct best captures key aspects of linguistic degradation observed in aphasia, showing realistic directional changes in NDW, word count, and word length amongst the synthetic generation methods. Based on the results, future work should plan to create a larger dataset, fine-tune models for better aphasic representation, and have SLPs assess the realism and usefulness of the synthetic transcripts.

[103] Model-Document Protocol for AI Search

Hongjin Qian, Zheng Liu

Main category: cs.CL

TL;DR: The paper introduces Model-Document Protocol (MDP), a framework that transforms unstructured documents into LLM-ready knowledge representations through agentic reasoning, memory grounding, and structured leveraging, with MDP-Agent implementation showing improved performance.

DetailsMotivation: Current retrieval methods return raw, unstructured text passages, forcing LLMs to handle fragment assembly and contextual reasoning, creating a gap in effective knowledge utilization.

Method: MDP framework with three pathways: agentic reasoning (curating evidence into coherent context), memory grounding (accumulating reusable notes), and structured leveraging (encoding documents into formal representations). MDP-Agent implements this through document-level gist memories, diffusion-based exploration with vertical exploitation, and map-reduce synthesis.

Result: Experiments on information-seeking benchmarks show MDP-Agent outperforms baselines, validating both the MDP framework and its agentic implementation.

Conclusion: MDP successfully bridges the gap between raw documents and LLMs by transforming unstructured text into compact, structured knowledge representations directly consumable for reasoning.

Abstract: AI search depends on linking large language models (LLMs) with vast external knowledge sources. Yet web pages, PDF files, and other raw documents are not inherently LLM-ready: they are long, noisy, and unstructured. Conventional retrieval methods treat these documents as verbatim text and return raw passages, leaving the burden of fragment assembly and contextual reasoning to the LLM. This gap underscores the need for a new retrieval paradigm that redefines how models interact with documents. We introduce the Model-Document Protocol (MDP), a general framework that formalizes how raw text is bridged to LLMs through consumable knowledge representations. Rather than treating retrieval as passage fetching, MDP defines multiple pathways that transform unstructured documents into task-specific, LLM-ready inputs. These include agentic reasoning, which curates raw evidence into coherent context; memory grounding, which accumulates reusable notes to enrich reasoning; and structured leveraging, which encodes documents into formal representations such as graphs or key-value caches. All three pathways share the same goal: ensuring that what reaches the LLM is not raw fragments but compact, structured knowledge directly consumable for reasoning. As an instantiation, we present MDP-Agent, which realizes the protocol through an agentic process: constructing document-level gist memories for global coverage, performing diffusion-based exploration with vertical exploitation to uncover layered dependencies, and applying map-reduce style synthesis to integrate large-scale evidence into compact yet sufficient context. Experiments on information-seeking benchmarks demonstrate that MDP-Agent outperforms baselines, validating both the soundness of the MDP framework and the effectiveness of its agentic instantiation.

[104] BhashaBench V1: A Comprehensive Benchmark for the Quadrant of Indic Domains

Vijay Devane, Mohd Nauman, Bhargav Patel, Aniket Mahendra Wakchoure, Yogeshkumar Sant, Shyam Pawar, Viraj Thakur, Ananya Godse, Sunil Patra, Neha Maurya, Suraj Racha, Nitish Kamal Singh, Ajay Nagpal, Piyush Sawarkar, Kundeshwar Vijayrao Pundalik, Rohit Saluja, Ganesh Ramakrishnan

Main category: cs.CL

TL;DR: BhashaBench V1 is a domain-specific, bilingual benchmark for evaluating LLMs on India-centric knowledge systems, covering Agriculture, Legal, Finance, and Ayurveda with 74,166 question-answer pairs in English and Hindi.

DetailsMotivation: Existing benchmarks are largely Anglocentric and domain-agnostic, limiting their applicability to India-specific contexts, creating a need for domain and culture-specific evaluation.

Method: Created a comprehensive benchmark with 74,166 curated question-answer pairs (52,494 English, 21,672 Hindi) sourced from authentic government and domain-specific exams across 4 major domains and 90+ subdomains.

Result: Evaluation of 29+ LLMs revealed significant domain and language performance gaps. GPT-4o achieved 76.49% in Legal but only 59.74% in Ayurveda. Models consistently performed better on English than Hindi across all domains.

Conclusion: BhashaBench V1 provides a comprehensive dataset for evaluating LLMs across India’s diverse knowledge domains and enables assessment of models’ ability to integrate domain-specific knowledge with bilingual understanding.

Abstract: The rapid advancement of large language models(LLMs) has intensified the need for domain and culture specific evaluation. Existing benchmarks are largely Anglocentric and domain-agnostic, limiting their applicability to India-centric contexts. To address this gap, we introduce BhashaBench V1, the first domain-specific, multi-task, bilingual benchmark focusing on critical Indic knowledge systems. BhashaBench V1 contains 74,166 meticulously curated question-answer pairs, with 52,494 in English and 21,672 in Hindi, sourced from authentic government and domain-specific exams. It spans four major domains: Agriculture, Legal, Finance, and Ayurveda, comprising 90+ subdomains and covering 500+ topics, enabling fine-grained evaluation. Evaluation of 29+ LLMs reveals significant domain and language specific performance gaps, with especially large disparities in low-resource domains. For instance, GPT-4o achieves 76.49% overall accuracy in Legal but only 59.74% in Ayurveda. Models consistently perform better on English content compared to Hindi across all domains. Subdomain-level analysis shows that areas such as Cyber Law, International Finance perform relatively well, while Panchakarma, Seed Science, and Human Rights remain notably weak. BhashaBench V1 provides a comprehensive dataset for evaluating large language models across India’s diverse knowledge domains. It enables assessment of models’ ability to integrate domain-specific knowledge with bilingual understanding. All code, benchmarks, and resources are publicly available to support open research.

[105] TwinVoice: A Multi-dimensional Benchmark Towards Digital Twins via LLM Persona Simulation

Bangde Du, Minghao Guo, Songming He, Ziyi Ye, Xi Zhu, Weihang Su, Shuqi Zhu, Yujia Zhou, Yongfeng Zhang, Qingyao Ai, Yiqun Liu

Main category: cs.CL

TL;DR: TwinVoice is a comprehensive benchmark for evaluating LLM-based persona simulation across social, interpersonal, and narrative contexts, revealing that while advanced models achieve moderate accuracy, they significantly underperform humans in key capabilities like syntactic style and memory recall.

DetailsMotivation: Current evaluations of LLM-based persona simulation are limited by reliance on synthetic dialogues, lack of systematic frameworks, and insufficient analysis of capability requirements, creating a need for more comprehensive assessment tools.

Method: Developed TwinVoice benchmark with three persona dimensions (Social, Interpersonal, Narrative) and six capability metrics (opinion consistency, memory recall, logical reasoning, lexical fidelity, persona tone, syntactic style) for systematic evaluation across diverse real-world contexts.

Result: Experimental results show advanced LLMs achieve moderate accuracy in persona simulation but significantly underperform in capabilities like syntactic style and memory recall, with average performance remaining considerably below human baseline.

Conclusion: While LLMs show emerging capabilities in persona simulation, substantial gaps remain in key areas, indicating the need for continued development to achieve human-level performance in simulating individual communication styles and personality traits.

Abstract: Large Language Models (LLMs) are exhibiting emergent human-like abilities and are increasingly envisioned as the foundation for simulating an individual’s communication style, behavioral tendencies, and personality traits. However, current evaluations of LLM-based persona simulation remain limited: most rely on synthetic dialogues, lack systematic frameworks, and lack analysis of the capability requirement. To address these limitations, we introduce TwinVoice, a comprehensive benchmark for assessing persona simulation across diverse real-world contexts. TwinVoice encompasses three dimensions: Social Persona (public social interactions), Interpersonal Persona (private dialogues), and Narrative Persona (role-based expression). It further decomposes the evaluation of LLM performance into six fundamental capabilities, including opinion consistency, memory recall, logical reasoning, lexical fidelity, persona tone, and syntactic style. Experimental results reveal that while advanced models achieve moderate accuracy in persona simulation, they still fall short of capabilities such as syntactic style and memory recall. Consequently, the average performance achieved by LLMs remains considerably below the human baseline.

Davide Romano, Jonathan Schwarz, Daniele Giofré

Main category: cs.CL

TL;DR: Empirical study of test-time scaling methods for legal multiple-choice QA, evaluating verifier-based approaches with 7 reward models under realistic computational budgets.

DetailsMotivation: Test-time scaling has proven effective in formal domains like math and programming, but its value in argumentative domains like law remains underexplored.

Method: Used verifier-based TTS methods with 7 reward models, evaluating both outcome-level (Best-of-N) and process-level (tree search) verification under low-N budgets across 5 legal MCQA benchmarks.

Result: Systematically investigated how verifier utility is affected by domain specialization, model size, and supervision type (process-supervised PRMs vs outcome-only ORMs) across different roles.

Conclusion: The study provides insights into the effectiveness of test-time scaling techniques specifically for legal reasoning tasks, addressing the gap in argumentative domains.

Abstract: Test-time scaling (TTS) techniques can improve the performance of large language models (LLMs) at the expense of additional computation and latency. While TTS has proven effective in formal domains such as mathematics and programming, its value in argumentative domains such as law remains underexplored. We present an empirical study of verifier-based TTS methods for legal multiple-choice QA (MCQA) across five benchmarks. Using a family of 7 reward models, we evaluate both outcome-level (Best-of-$N$) and process-level (tree search) verification under realistic low-$N$ budgets. Our analysis systematically investigates how verifier utility is affected by key properties such as domain specialization, model size, and supervision type (process-supervised PRMs vs. outcome-only ORMs), even when applied across different roles.

[107] PairUni: Pairwise Training for Unified Multimodal Language Models

Jiani Zheng, Zhiyang Teng, Xiangtai Li, Anran Wang, Yu Tian, Kunpeng Qiu, Ye Tian, Haochen Wang, Zhuochen Wang

Main category: cs.CL

TL;DR: PairUni is a unified framework that organizes vision-language data into understanding-generation pairs and uses a pair-aware RL method to balance both tasks in UVLMs.

DetailsMotivation: UVLMs need to handle both understanding and generation tasks, but these tasks use different data and supervision, making it hard to balance them during RL training.

Method: Use GPT-o3 to augment data by generating captions for understanding samples and QA pairs for generation samples, forming aligned pairs from the same instance. Also retrieve semantically related examples to form retrieved pairs. Then apply Pair-GPRO, a pair-aware RL variant that uses similarity scores to modulate advantages.

Result: The approach achieves balanced improvements on various UVLMs and outperforms strong UVLM RL baselines.

Conclusion: PairUni effectively balances understanding and generation tasks in UVLMs through structured data pairing and pair-aware RL optimization.

Abstract: Unified vision-language models (UVLMs) must perform both understanding and generation within a single architecture, but these tasks rely on heterogeneous data and supervision, making it difficult to balance them during reinforcement learning (RL). We propose PairUni, a unified framework that reorganizes data into understanding-generation (UG) pairs and aligns optimization accordingly. We first use GPT-o3 to augment single-task data, generating captions for understanding samples and question-answer (QA) pairs for generation samples, forming aligned pairs from the same instance. Additionally, for each generation sample, we retrieve a semantically related understanding example to form a retrieved pair, linking different but related data points. These paired structures expose cross-task semantic correspondences and support consistent policy learning. To leverage this structure, we present Pair-GPRO, a pair-aware variant based on Group Relative Policy Optimization. It assigns a similarity score to each pair to modulate the advantage, strengthening learning from well-aligned examples and reducing task interference. We curate a high-quality dataset of 16K UG pairs named PairUG for RL fine-tuning and evaluate PairUni on the powerful Janus-Pro UVLMs. Our approach achieves balanced improvements on various UVLMs, outperforming strong UVLM RL baselines. Codes are available at https://github.com/Haochen-Wang409/PairUni.

[108] Completion $\neq$ Collaboration: Scaling Collaborative Effort with Agents

Shannon Zejiang Shen, Valerie Chen, Ken Gu, Alexis Ross, Zixian Ma, Jillian Ross, Alex Gu, Chenglei Si, Wayne Chi, Andi Peng, Jocelyn J Shen, Ameet Talwalkar, Tongshuang Wu, David Sontag

Main category: cs.CL

TL;DR: The paper argues for shifting from one-shot task completion agents to collaborative agents that work iteratively with humans, and introduces collaborative effort scaling as a framework to measure how agent utility grows with user involvement.

DetailsMotivation: Current agent evaluations focus on one-shot task completion, which fails to capture the iterative and collaborative nature of real-world problems where human goals are often underspecified and evolve over time.

Method: The paper introduces collaborative effort scaling framework and conducts case studies and simulated evaluations to analyze how agents perform in multi-turn, real-world scenarios.

Result: State-of-the-art agents often underperform in multi-turn, real-world scenarios, revealing they lack the ability to sustain engagement and scaffold user understanding throughout collaborative problem-solving processes.

Conclusion: Collaborative effort scaling provides a diagnostic tool for understanding agent behavior and guiding development toward more effective human-agent interactions that enhance collaborative problem-solving.

Abstract: Current evaluations of agents remain centered around one-shot task completion, failing to account for the inherently iterative and collaborative nature of many real-world problems, where human goals are often underspecified and evolve. We argue for a shift from building and assessing task completion agents to developing collaborative agents, assessed not only by the quality of their final outputs but by how well they engage with and enhance human effort throughout the problem-solving process. To support this shift, we introduce collaborative effort scaling, a framework that captures how an agent’s utility grows with increasing user involvement. Through case studies and simulated evaluations, we show that state-of-the-art agents often underperform in multi-turn, real-world scenarios, revealing a missing ingredient in agent design: the ability to sustain engagement and scaffold user understanding. Collaborative effort scaling offers a lens for diagnosing agent behavior and guiding development toward more effective interactions.

cs.CV

[109] Enhancing Underwater Object Detection through Spatio-Temporal Analysis and Spatial Attention Networks

Sai Likhith Karri, Ansh Saxena

Main category: cs.CV

TL;DR: The study evaluates spatio-temporal modeling and spatial attention mechanisms in deep learning for underwater object detection, comparing YOLOv5, T-YOLOv5, and T-YOLOv5 with CBAM.

DetailsMotivation: To improve object detection accuracy in dynamic marine environments with challenges like sudden movements, partial occlusions, and gradual motion.

Method: Two-phase approach: first evaluates T-YOLOv5 (temporal-enhanced YOLOv5) vs standard YOLOv5, then develops T-YOLOv5 with CBAM (Convolutional Block Attention Module) for enhanced spatial attention.

Result: T-YOLOv5 and T-YOLOv5 with CBAM significantly outperformed standard YOLOv5: mAP@50-95 scores were 0.563 (YOLOv5), 0.813 (T-YOLOv5), and 0.811 (T-YOLOv5 with CBAM).

Conclusion: Temporal modeling significantly enhances detection reliability in marine environments, while CBAM further improves performance in challenging scenarios but may reduce accuracy in simpler cases.

Abstract: This study examines the effectiveness of spatio-temporal modeling and the integration of spatial attention mechanisms in deep learning models for underwater object detection. Specifically, in the first phase, the performance of temporal-enhanced YOLOv5 variant T-YOLOv5 is evaluated, in comparison with the standard YOLOv5. For the second phase, an augmented version of T-YOLOv5 is developed, through the addition of a Convolutional Block Attention Module (CBAM). By examining the effectiveness of the already pre-existing YOLOv5 and T-YOLOv5 models and of the newly developed T-YOLOv5 with CBAM. With CBAM, the research highlights how temporal modeling improves detection accuracy in dynamic marine environments, particularly under conditions of sudden movements, partial occlusions, and gradual motion. The testing results showed that YOLOv5 achieved a mAP@50-95 of 0.563, while T-YOLOv5 and T-YOLOv5 with CBAM outperformed with mAP@50-95 scores of 0.813 and 0.811, respectively, highlighting their superior accuracy and generalization in detecting complex objects. The findings demonstrate that T-YOLOv5 significantly enhances detection reliability compared to the standard model, while T-YOLOv5 with CBAM further improves performance in challenging scenarios, although there is a loss of accuracy when it comes to simpler scenarios.

[110] AdSum: Two-stream Audio-visual Summarization for Automated Video Advertisement Clipping

Wen Xie, Yanjun Zhu, Gijs Overgoor, Yakov Bart, Agata Lapedriza Garcia, Sarah Ostadabbas

Main category: cs.CV

TL;DR: Automated video ad clipping framework using audio-visual fusion for shot selection, outperforming existing methods on multiple metrics.

DetailsMotivation: Manual creation of multiple ad versions from longer videos is labor-intensive and time-consuming, requiring automated solutions.

Method: Two-stream audio-visual fusion model that predicts frame importance for shot selection, using a novel AdSum204 dataset of real ad pairs.

Result: Outperforms state-of-the-art methods across Average Precision, Area Under Curve, Spearman, and Kendall metrics.

Conclusion: The framework successfully automates video ad clipping by emphasizing audio importance and achieves superior performance compared to existing approaches.

Abstract: Advertisers commonly need multiple versions of the same advertisement (ad) at varying durations for a single campaign. The traditional approach involves manually selecting and re-editing shots from longer video ads to create shorter versions, which is labor-intensive and time-consuming. In this paper, we introduce a framework for automated video ad clipping using video summarization techniques. We are the first to frame video clipping as a shot selection problem, tailored specifically for advertising. Unlike existing general video summarization methods that primarily focus on visual content, our approach emphasizes the critical role of audio in advertising. To achieve this, we develop a two-stream audio-visual fusion model that predicts the importance of video frames, where importance is defined as the likelihood of a frame being selected in the firm-produced short ad. To address the lack of ad-specific datasets, we present AdSum204, a novel dataset comprising 102 pairs of 30-second and 15-second ads from real advertising campaigns. Extensive experiments demonstrate that our model outperforms state-of-the-art methods across various metrics, including Average Precision, Area Under Curve, Spearman, and Kendall.

[111] MIRO: MultI-Reward cOnditioned pretraining improves T2I quality and efficiency

Nicolas Dufour, Lucas Degeorge, Arijit Ghosh, Vicky Kalogeiton, David Picard

Main category: cs.CV

TL;DR: MIRO is a training method that conditions text-to-image models on multiple reward models to directly learn user preferences, improving visual quality and training speed while maintaining diversity.

DetailsMotivation: Current text-to-image models trained on uncurated datasets don't align well with user preferences, and post-hoc reward-based selection methods harm diversity, semantic fidelity, and efficiency by discarding data.

Method: Instead of post-processing, MIRO conditions the model on multiple reward models during training to directly learn user preferences.

Result: MIRO dramatically improves visual quality, significantly speeds up training, and achieves state-of-the-art performance on GenEval compositional benchmark and user-preference scores (PickAScore, ImageReward, HPSv2).

Conclusion: Directly conditioning on multiple reward models during training is more effective than post-hoc selection for aligning text-to-image generation with user preferences while maintaining diversity and efficiency.

Abstract: Current text-to-image generative models are trained on large uncurated datasets to enable diverse generation capabilities. However, this does not align well with user preferences. Recently, reward models have been specifically designed to perform post-hoc selection of generated images and align them to a reward, typically user preference. This discarding of informative data together with the optimizing for a single reward tend to harm diversity, semantic fidelity and efficiency. Instead of this post-processing, we propose to condition the model on multiple reward models during training to let the model learn user preferences directly. We show that this not only dramatically improves the visual quality of the generated images but it also significantly speeds up the training. Our proposed method, called MIRO, achieves state-of-the-art performances on the GenEval compositional benchmark and user-preference scores (PickAScore, ImageReward, HPSv2).

[112] BikeScenes: Online LiDAR Semantic Segmentation for Bicycles

Denniz Goren, Holger Caesar

Main category: cs.CV

TL;DR: This paper introduces BikeScenes-lidarseg dataset and demonstrates that domain-specific training on bicycle-mounted LiDAR systems significantly improves 3D segmentation performance compared to automotive datasets.

DetailsMotivation: Cyclist vulnerability is increasing with the popularity of faster e-bikes, creating a need to adapt automotive perception technologies for bicycle safety applications.

Method: Developed a multi-sensor ‘SenseBike’ research platform and created the BikeScenes-lidarseg Dataset with 3021 LiDAR scans annotated for 29 classes. Fine-tuned models on this bicycle-specific dataset.

Result: Fine-tuning on BikeScenes dataset achieved mIoU of 63.6%, dramatically outperforming the 13.8% obtained with SemanticKITTI pre-training alone, demonstrating significant domain gap.

Conclusion: Domain-specific training is essential for bicycle-mounted perception systems, and the BikeScenes dataset provides a valuable resource for advancing cyclist-centric LiDAR segmentation research.

Abstract: The vulnerability of cyclists, exacerbated by the rising popularity of faster e-bikes, motivates adapting automotive perception technologies for bicycle safety. We use our multi-sensor ‘SenseBike’ research platform to develop and evaluate a 3D LiDAR segmentation approach tailored to bicycles. To bridge the automotive-to-bicycle domain gap, we introduce the novel BikeScenes-lidarseg Dataset, comprising 3021 consecutive LiDAR scans around the university campus of the TU Delft, semantically annotated for 29 dynamic and static classes. By evaluating model performance, we demonstrate that fine-tuning on our BikeScenes dataset achieves a mean Intersection-over-Union (mIoU) of 63.6%, significantly outperforming the 13.8% obtained with SemanticKITTI pre-training alone. This result underscores the necessity and effectiveness of domain-specific training. We highlight key challenges specific to bicycle-mounted, hardware-constrained perception systems and contribute the BikeScenes dataset as a resource for advancing research in cyclist-centric LiDAR segmentation.

[113] Generative Image Restoration and Super-Resolution using Physics-Informed Synthetic Data for Scanning Tunneling Microscopy

Nikola L. Kolev, Tommaso Rodani, Neil J. Curson, Taylor J. Z. Stock, Alberto Cazzaniga

Main category: cs.CV

TL;DR: Machine learning approach using physics-informed synthetic data enables STM image repair and super-resolution, reducing image acquisition time 2-4x and decreasing need for tip conditioning.

DetailsMotivation: STM limitations include tip degradation, slow serial data acquisition, and tip conditioning requirements due to voltage-induced apex shape changes.

Method: Used physics-informed synthetic data generation pipeline with 36 pristine experimental Si(001):H images to train flow-matching and diffusion models.

Result: Models effectively restore images and achieve 2-4x reduction in image acquisition time by accurately reconstructing from sparsely sampled data, validated by CLIP Maximum Mean Discrepancy and structural similarity metrics.

Conclusion: The framework can significantly increase STM experimental throughput by reducing tip-conditioning frequency and enhancing frame rates in high-speed STM systems.

Abstract: Scanning tunnelling microscopy (STM) enables atomic-resolution imaging and atom manipulation, but its utility is often limited by tip degradation and slow serial data acquisition. Fabrication adds another layer of complexity since the tip is often subjected to large voltages, which may alter the shape of its apex, requiring it to be conditioned. Here, we propose a machine learning (ML) approach for image repair and super-resolution to alleviate both challenges. Using a dataset of only 36 pristine experimental images of Si(001):H, we demonstrate that a physics-informed synthetic data generation pipeline can be used to train several state-of-the-art flow-matching and diffusion models. Quantitative evaluation with metrics such as the CLIP Maximum Mean Discrepancy (CMMD) score and structural similarity demonstrates that our models are able to effectively restore images and offer a two- to fourfold reduction in image acquisition time by accurately reconstructing images from sparsely sampled data. Our framework has the potential to significantly increase STM experimental throughput by offering a route to reducing the frequency of tip-conditioning procedures and to enhancing frame rates in existing high-speed STM systems.

[114] SplitFlow: Flow Decomposition for Inversion-Free Text-to-Image Editing

Sung-Hoon Yoon, Minghan Li, Gaspard Beaudouin, Congcong Wen, Muhammad Rafay Azhar, Mengyu Wang

Main category: cs.CV

TL;DR: Proposes a flow decomposition-and-aggregation framework for image editing that semantically decomposes target prompts into sub-prompts, computes independent flows for each, and aggregates them with adaptive weighting to address inversion and gradient entanglement issues in rectified flow models.

DetailsMotivation: Rectified flow models face limitations in image editing tasks due to inaccurate inversion processes and gradient entanglement issues, leading to outputs that don't faithfully reflect target prompts. Recent ODE-based approaches without inversion still yield suboptimal editing quality.

Method: Semantically decomposes target prompt into multiple sub-prompts, computes independent flow for each, and aggregates them using projection and soft-aggregation mechanism inspired by gradient conflict resolution in multi-task learning. Adaptively weights sub-target velocity fields to suppress redundancy while emphasizing distinct directions.

Result: Outperforms existing zero-shot editing approaches in terms of semantic fidelity and attribute disentanglement. The framework preserves both diversity and consistency in final edited output.

Conclusion: The proposed flow decomposition-and-aggregation framework effectively addresses limitations of rectified flow models in image editing by enabling semantic decomposition and adaptive aggregation, achieving superior editing quality compared to existing methods.

Abstract: Rectified flow models have become a de facto standard in image generation due to their stable sampling trajectories and high-fidelity outputs. Despite their strong generative capabilities, they face critical limitations in image editing tasks: inaccurate inversion processes for mapping real images back into the latent space, and gradient entanglement issues during editing often result in outputs that do not faithfully reflect the target prompt. Recent efforts have attempted to directly map source and target distributions via ODE-based approaches without inversion; however,these methods still yield suboptimal editing quality. In this work, we propose a flow decomposition-and-aggregation framework built upon an inversion-free formulation to address these limitations. Specifically, we semantically decompose the target prompt into multiple sub-prompts, compute an independent flow for each, and aggregate them to form a unified editing trajectory. While we empirically observe that decomposing the original flow enhances diversity in the target space, generating semantically aligned outputs still requires consistent guidance toward the full target prompt. To this end, we design a projection and soft-aggregation mechanism for flow, inspired by gradient conflict resolution in multi-task learning. This approach adaptively weights the sub-target velocity fields, suppressing semantic redundancy while emphasizing distinct directions, thereby preserving both diversity and consistency in the final edited output. Experimental results demonstrate that our method outperforms existing zero-shot editing approaches in terms of semantic fidelity and attribute disentanglement. The code is available at https://github.com/Harvard-AI-and-Robotics-Lab/SplitFlow.

[115] Brain-IT: Image Reconstruction from fMRI via Brain-Interaction Transformer

Roman Beliy, Amit Zalcher, Jonathan Kogman, Navve Wasserman, Michal Irani

Main category: cs.CV

TL;DR: Brain-IT is a brain-inspired approach using Brain Interaction Transformer (BIT) to reconstruct images from fMRI data with improved faithfulness, achieving state-of-the-art results with limited training data.

DetailsMotivation: Current methods for reconstructing images from fMRI brain recordings often lack faithfulness to the actual seen images, despite recent progress with diffusion models.

Method: Uses Brain Interaction Transformer (BIT) with functional brain-voxel clusters shared across subjects. Predicts complementary patch-level features: high-level semantic features for content guidance and low-level structural features for layout initialization in diffusion models.

Result: Achieves faithful image reconstructions from fMRI that surpass current state-of-the-art approaches both visually and by objective metrics. With only 1-hour of fMRI data from new subjects, achieves comparable results to methods trained on 40-hour recordings.

Conclusion: Brain-IT’s brain-inspired design enables effective information flow from brain-voxel clusters to image features, allowing high-quality image reconstruction with minimal training data.

Abstract: Reconstructing images seen by people from their fMRI brain recordings provides a non-invasive window into the human brain. Despite recent progress enabled by diffusion models, current methods often lack faithfulness to the actual seen images. We present “Brain-IT”, a brain-inspired approach that addresses this challenge through a Brain Interaction Transformer (BIT), allowing effective interactions between clusters of functionally-similar brain-voxels. These functional-clusters are shared by all subjects, serving as building blocks for integrating information both within and across brains. All model components are shared by all clusters & subjects, allowing efficient training with a limited amount of data. To guide the image reconstruction, BIT predicts two complementary localized patch-level image features: (i)high-level semantic features which steer the diffusion model toward the correct semantic content of the image; and (ii)low-level structural features which help to initialize the diffusion process with the correct coarse layout of the image. BIT’s design enables direct flow of information from brain-voxel clusters to localized image features. Through these principles, our method achieves image reconstructions from fMRI that faithfully reconstruct the seen images, and surpass current SotA approaches both visually and by standard objective metrics. Moreover, with only 1-hour of fMRI data from a new subject, we achieve results comparable to current methods trained on full 40-hour recordings.

[116] TRUST-VL: An Explainable News Assistant for General Multimodal Misinformation Detection

Zehong Yan, Peng Qi, Wynne Hsu, Mong Li Lee

Main category: cs.CV

TL;DR: TRUST-VL is a unified vision-language model for multimodal misinformation detection that achieves state-of-the-art performance through joint training across distortion types and offers strong generalization and interpretability.

DetailsMotivation: Multimodal misinformation poses increasing societal threats amplified by generative AI, and existing methods struggle with generalization across different distortion types.

Method: Introduces TRUST-VL with a Question-Aware Visual Amplifier module for task-specific visual features, trained on TRUST-Instruct dataset containing 198K samples with structured reasoning chains aligned with human fact-checking workflows.

Result: Extensive experiments show TRUST-VL achieves state-of-the-art performance on both in-domain and zero-shot benchmarks.

Conclusion: Joint training across distortion types facilitates knowledge sharing and enhances generalization capabilities for multimodal misinformation detection.

Abstract: Multimodal misinformation, encompassing textual, visual, and cross-modal distortions, poses an increasing societal threat that is amplified by generative AI. Existing methods typically focus on a single type of distortion and struggle to generalize to unseen scenarios. In this work, we observe that different distortion types share common reasoning capabilities while also requiring task-specific skills. We hypothesize that joint training across distortion types facilitates knowledge sharing and enhances the model’s ability to generalize. To this end, we introduce TRUST-VL, a unified and explainable vision-language model for general multimodal misinformation detection. TRUST-VL incorporates a novel Question-Aware Visual Amplifier module, designed to extract task-specific visual features. To support training, we also construct TRUST-Instruct, a large-scale instruction dataset containing 198K samples featuring structured reasoning chains aligned with human fact-checking workflows. Extensive experiments on both in-domain and zero-shot benchmarks demonstrate that TRUST-VL achieves state-of-the-art performance, while also offering strong generalization and interpretability.

[117] Fine-tuning Segment Anything for Real-Time Tumor Tracking in Cine-MRI

Valentin Boussot, Cédric Hémon, Jean-Claude Nunes, Jean-Louis Dillenseger

Main category: cs.CV

TL;DR: The paper presents a SAM 2.1-based foundation model approach for real-time tumor tracking in cine-MRI sequences, achieving 0.8794 Dice score in the TrackRAD2025 challenge under data scarcity constraints.

DetailsMotivation: To address the TrackRAD2025 challenge of real-time tumor tracking in thoracic and abdominal cine-MRI sequences under strong data scarcity constraints.

Method: Used SAM 2.1 foundation model with mask-based prompts from first annotated slice, fine-tuned on small labeled subset. Applied balanced Dice + IoU loss with 1024x1024 patches, standard augmentations, and low learning rate (0.0001) across all modules to prevent overfitting.

Result: Achieved Dice score of 0.8794 on hidden test set, ranking 6th overall in TrackRAD2025 challenge. The method maintained consistent performance across anatomical sites and MRI field strengths.

Conclusion: Foundation models like SAM 2.1 show strong potential for accurate and real-time tumor tracking in MRI-guided radiotherapy, even under data scarcity conditions.

Abstract: In this work, we address the TrackRAD2025 challenge of real-time tumor tracking in cine-MRI sequences of the thoracic and abdominal regions under strong data scarcity constraints. Two complementary strategies were explored: (i) unsupervised registration with the IMPACT similarity metric and (ii) foundation model-based segmentation leveraging SAM 2.1 and its recent variants through prompt-based interaction. Due to the one-second runtime constraint, the SAM-based method was ultimately selected. The final configuration used SAM2.1 b+ with mask-based prompts from the first annotated slice, fine-tuned solely on the small labeled subset from TrackRAD2025. Training was configured to minimize overfitting, using 1024x1024 patches (batch size 1), standard augmentations, and a balanced Dice + IoU loss. A low uniform learning rate (0.0001) was applied to all modules (prompt encoder, decoder, Hiera backbone) to preserve generalization while adapting to annotator-specific styles. Training lasted 300 epochs (~12h on RTX A6000, 48GB). The same inference strategy was consistently applied across all anatomical sites and MRI field strengths. Test-time augmentation was considered but ultimately discarded due to negligible performance gains. The final model was selected based on the highest Dice Similarity Coefficient achieved on the validation set after fine-tuning. On the hidden test set, the model reached a Dice score of 0.8794, ranking 6th overall in the TrackRAD2025 challenge. These results highlight the strong potential of foundation models for accurate and real-time tumor tracking in MRI-guided radiotherapy.

[118] Larger Hausdorff Dimension in Scanning Pattern Facilitates Mamba-Based Methods in Low-Light Image Enhancement

Xinhua Wang, Caibo Feng, Xiangjun Fu, Chunxiao Liu

Main category: cs.CV

TL;DR: Enhanced Mamba framework with Hilbert Selective Scan mechanism increases Hausdorff dimension for better feature space exploration, improving low-light image enhancement performance while reducing computational costs.

DetailsMotivation: To address information inconsistencies and improve spatial locality in Mamba-based methods while maintaining long-range dependency handling capabilities.

Method: Proposed Hilbert Selective Scan mechanism that increases the Hausdorff dimension of Mamba’s scanning pattern for more effective feature space exploration and fine-scale detail capture.

Result: Significantly improves quantitative metrics and qualitative visual fidelity on low-light image enhancement benchmarks while reducing computational resource consumption and inference time.

Conclusion: The refined strategy advances state-of-the-art in low-light image enhancement and shows promise for broader Mamba-based applications.

Abstract: We propose an innovative enhancement to the Mamba framework by increasing the Hausdorff dimension of its scanning pattern through a novel Hilbert Selective Scan mechanism. This mechanism explores the feature space more effectively, capturing intricate fine-scale details and improving overall coverage. As a result, it mitigates information inconsistencies while refining spatial locality to better capture subtle local interactions without sacrificing the model’s ability to handle long-range dependencies. Extensive experiments on publicly available benchmarks demonstrate that our approach significantly improves both the quantitative metrics and qualitative visual fidelity of existing Mamba-based low-light image enhancement methods, all while reducing computational resource consumption and shortening inference time. We believe that this refined strategy not only advances the state-of-the-art in low-light image enhancement but also holds promise for broader applications in fields that leverage Mamba-based techniques.

[119] CYPRESS: Crop Yield Prediction via Regression on Prithvi’s Encoder for Satellite Sensing

Shayan Nejadshamsi, Yuanyuan Zhang, Shadi Zaki, Brock Porth, Lysa Porth, Vahab Khoshdel

Main category: cs.CV

TL;DR: CYPRESS is a deep learning model for high-resolution canola yield prediction using satellite imagery, outperforming existing methods by fine-tuning a geospatial foundation model.

DetailsMotivation: Traditional crop yield prediction methods lack scalability and granularity needed for precision farming, creating a need for more detailed and actionable tools.

Method: Fine-tunes Prithvi-EO-2.0-600M geospatial foundation model for continuous regression, transforming multi-temporal satellite imagery into pixel-level yield maps.

Result: Demonstrates superior performance over existing deep learning models on Canadian Prairies dataset, providing continuous high-resolution yield predictions.

Conclusion: Validates an effective approach bridging large-scale Earth observation with on-farm decision-making, offering scalable solution for detailed agricultural monitoring.

Abstract: Accurate and timely crop yield prediction is crucial for global food security and modern agricultural management. Traditional methods often lack the scalability and granularity required for precision farming. This paper introduces CYPRESS (Crop Yield Prediction via Regression on Prithvi’s Encoder for Satellite Sensing), a deep learning model designed for high-resolution, intra-field canola yield prediction. CYPRESS leverages a pre-trained, large-scale geospatial foundation model (Prithvi-EO-2.0-600M) and adapts it for a continuous regression task, transforming multi-temporal satellite imagery into dense, pixel-level yield maps. Evaluated on a comprehensive dataset from the Canadian Prairies, CYPRESS demonstrates superior performance over existing deep learning-based yield prediction models, highlighting the effectiveness of fine-tuning foundation models for specialized agricultural applications. By providing a continuous, high-resolution output, CYPRESS offers a more actionable tool for precision agriculture than conventional classification or county-level aggregation methods. This work validates a novel approach that bridges the gap between large-scale Earth observation and on-farm decision-making, offering a scalable solution for detailed agricultural monitoring.

[120] CAVE: Detecting and Explaining Commonsense Anomalies in Visual Environments

Rishika Bhagwatkar, Syrielle Montariol, Angelika Romanou, Beatriz Borges, Irina Rish, Antoine Bosselut

Main category: cs.CV

TL;DR: CAVE is the first benchmark for real-world visual anomalies with three tasks: description, explanation, and justification, using cognitive science principles to evaluate VLMs’ anomaly perception and reasoning capabilities.

DetailsMotivation: Current anomaly detection in computer vision is limited to industrial defects or synthetic anomalies, failing to capture real-world complexity and unpredictability. Humans naturally identify and explain anomalies, but VLMs struggle with this capability.

Method: CAVE benchmark with fine-grained annotations for visual grounding and categorization based on visual manifestations, complexity, severity, and commonness. Inspired by cognitive science research on human anomaly perception.

Result: State-of-the-art VLMs struggle with visual anomaly perception and commonsense reasoning, even with advanced prompting strategies.

Conclusion: CAVE provides a realistic and cognitively grounded benchmark for advancing research in anomaly detection and commonsense reasoning in VLMs.

Abstract: Humans can naturally identify, reason about, and explain anomalies in their environment. In computer vision, this long-standing challenge remains limited to industrial defects or unrealistic, synthetically generated anomalies, failing to capture the richness and unpredictability of real-world anomalies. In this work, we introduce CAVE, the first benchmark of real-world visual anomalies. CAVE supports three open-ended tasks: anomaly description, explanation, and justification; with fine-grained annotations for visual grounding and categorizing anomalies based on their visual manifestations, their complexity, severity, and commonness. These annotations draw inspiration from cognitive science research on how humans identify and resolve anomalies, providing a comprehensive framework for evaluating Vision-Language Models (VLMs) in detecting and understanding anomalies. We show that state-of-the-art VLMs struggle with visual anomaly perception and commonsense reasoning, even with advanced prompting strategies. By offering a realistic and cognitively grounded benchmark, CAVE serves as a valuable resource for advancing research in anomaly detection and commonsense reasoning in VLMs.

[121] Surpassing state of the art on AMD area estimation from RGB fundus images through careful selection of U-Net architectures and loss functions for class imbalance

Valentyna Starodub, Mantas Lukoševičius

Main category: cs.CV

TL;DR: This paper presents an improved AMD lesion detection framework using semantic segmentation on RGB fundus images, achieving state-of-the-art performance on the ADAM challenge benchmark.

DetailsMotivation: AMD is a leading cause of irreversible vision impairment in elderly populations, and there's a need for effective detection methods using non-invasive, cost-effective RGB fundus imaging.

Method: Used U-Net as base framework and evaluated various improvements including pre-processing techniques, different encoder backbone networks of varying complexity, and specialized loss functions to address class imbalances at image and pixel levels.

Result: The final framework configuration outperformed all prior submissions in the ADAM challenge for multi-class segmentation of different AMD lesion types in RGB fundus images.

Conclusion: The research successfully developed an advanced AMD detection framework that sets new benchmarks for semantic segmentation of AMD lesions in non-invasive fundus imaging, with source code made publicly available.

Abstract: Age-related macular degeneration (AMD) is one of the leading causes of irreversible vision impairment in people over the age of 60. This research focuses on semantic segmentation for AMD lesion detection in RGB fundus images, a non-invasive and cost-effective imaging technique. The results of the ADAM challenge - the most comprehensive AMD detection from RGB fundus images research competition and open dataset to date - serve as a benchmark for our evaluation. Taking the U-Net connectivity as a base of our framework, we evaluate and compare several approaches to improve the segmentation model’s architecture and training pipeline, including pre-processing techniques, encoder (backbone) deep network types of varying complexity, and specialized loss functions to mitigate class imbalances on image and pixel levels. The main outcome of this research is the final configuration of the AMD detection framework, which outperforms all the prior ADAM challenge submissions on the multi-class segmentation of different AMD lesion types in non-invasive RGB fundus images. The source code used to conduct the experiments presented in this paper is made freely available.

[122] Climate Adaptation-Aware Flood Prediction for Coastal Cities Using Deep Learning

Bilal Hassan, Areg Karapetyan, Aaron Chung Hin Chow, Samer Madanat

Main category: cs.CV

TL;DR: A lightweight CNN model for coastal flood prediction that outperforms state-of-the-art methods by reducing MAE by nearly 20%, generalizing across Abu Dhabi and San Francisco regions.

DetailsMotivation: Climate change and sea-level rise threaten coastal cities, but traditional physics-based simulators are computationally expensive and impractical for city-scale planning, while existing DL methods face data scarcity and high-dimensional output challenges.

Method: Developed a novel lightweight CNN-based model using a vision-based, low-resource DL framework to predict coastal flooding under variable SLR projections and shoreline adaptation scenarios.

Result: The model significantly outperforms state-of-the-art methods, reducing mean absolute error in predicted flood depth maps by nearly 20% on average, and demonstrates generalization across diverse geographical contexts.

Conclusion: The approach serves as a scalable and practical tool for coastal flood management, empowering decision-makers to develop effective mitigation strategies against climate change impacts.

Abstract: Climate change and sea-level rise (SLR) pose escalating threats to coastal cities, intensifying the need for efficient and accurate methods to predict potential flood hazards. Traditional physics-based hydrodynamic simulators, although precise, are computationally expensive and impractical for city-scale coastal planning applications. Deep Learning (DL) techniques offer promising alternatives, however, they are often constrained by challenges such as data scarcity and high-dimensional output requirements. Leveraging a recently proposed vision-based, low-resource DL framework, we develop a novel, lightweight Convolutional Neural Network (CNN)-based model designed to predict coastal flooding under variable SLR projections and shoreline adaptation scenarios. Furthermore, we demonstrate the ability of the model to generalize across diverse geographical contexts by utilizing datasets from two distinct regions: Abu Dhabi and San Francisco. Our findings demonstrate that the proposed model significantly outperforms state-of-the-art methods, reducing the mean absolute error (MAE) in predicted flood depth maps on average by nearly 20%. These results highlight the potential of our approach to serve as a scalable and practical tool for coastal flood management, empowering decision-makers to develop effective mitigation strategies in response to the growing impacts of climate change. Project Page: https://caspiannet.github.io/

[123] Enhancing Temporal Understanding in Video-LLMs through Stacked Temporal Attention in Vision Encoders

Ali Rasekh, Erfan Bagheri Soula, Omid Daliran, Simon Gottschalk, Mohsen Fayyaz

Main category: cs.CV

TL;DR: The paper proposes STAVEQ2, a Video-LLM architecture with stacked temporal attention modules in the vision encoder to improve temporal understanding in videos, achieving up to +5.5% improvement on video QA benchmarks.

DetailsMotivation: Current Video-LLMs struggle with complex temporal dynamics in videos, particularly in understanding action sequences and temporal progression, which limits their ability to comprehend video content effectively.

Method: Introduces stacked temporal attention modules directly within the vision encoder, enabling the model to better capture action progression and frame relationships before passing visual tokens to the LLM.

Result: Significantly improves temporal reasoning and outperforms existing models in video question answering tasks, with up to +5.5% improvement on benchmarks including VITATECS, MVBench, and Video-MME.

Conclusion: Enhancing the vision encoder with temporal structure addresses a critical gap in video understanding for Video-LLMs, enabling better comprehension of temporal dynamics in videos.

Abstract: Despite significant advances in Multimodal Large Language Models (MLLMs), understanding complex temporal dynamics in videos remains a major challenge. Our experiments show that current Video Large Language Model (Video-LLM) architectures have critical limitations in temporal understanding, struggling with tasks that require detailed comprehension of action sequences and temporal progression. In this work, we propose a Video-LLM architecture that introduces stacked temporal attention modules directly within the vision encoder. This design incorporates a temporal attention in vision encoder, enabling the model to better capture the progression of actions and the relationships between frames before passing visual tokens to the LLM. Our results show that this approach significantly improves temporal reasoning and outperforms existing models in video question answering tasks, specifically in action recognition. We improve on benchmarks including VITATECS, MVBench, and Video-MME by up to +5.5%. By enhancing the vision encoder with temporal structure, we address a critical gap in video understanding for Video-LLMs. Project page and code are available at: https://alirasekh.github.io/STAVEQ2/.

[124] FlexICL: A Flexible Visual In-context Learning Framework for Elbow and Wrist Ultrasound Segmentation

Yuyue Zhou, Jessica Knight, Shrimanti Ghosh, Banafshe Felfeliyan, Jacob L. Jaremko, Abhilash R. Hareendranathan

Main category: cs.CV

TL;DR: FlexICL is a flexible in-context learning framework for segmenting bony regions in ultrasound images, achieving robust performance with only 5% labeled training data across wrist and elbow datasets.

DetailsMotivation: Automatic segmentation of musculoskeletal structures in ultrasound can improve diagnostic accuracy for pediatric elbow and wrist fractures, but pixel-wise expert annotations are time-consuming and costly.

Method: Proposed FlexICL framework using in-context learning for intra-video segmentation, with novel image concatenation techniques and multiple augmentation strategies to enhance performance with limited labeled data.

Result: Outperformed state-of-the-art visual ICL models (Painter, MAE-VQGAN) and conventional segmentation models (U-Net, TransUNet) by 1-27% Dice coefficient on 1,252 US sweeps.

Conclusion: FlexICL shows potential as an efficient and scalable solution for ultrasound image segmentation in medical imaging where labeled data is scarce.

Abstract: Elbow and wrist fractures are the most common fractures in pediatric populations. Automatic segmentation of musculoskeletal structures in ultrasound (US) can improve diagnostic accuracy and treatment planning. Fractures appear as cortical defects but require expert interpretation. Deep learning (DL) can provide real-time feedback and highlight key structures, helping lightly trained users perform exams more confidently. However, pixel-wise expert annotations for training remain time-consuming and costly. To address this challenge, we propose FlexICL, a novel and flexible in-context learning (ICL) framework for segmenting bony regions in US images. We apply it to an intra-video segmentation setting, where experts annotate only a small subset of frames, and the model segments unseen frames. We systematically investigate various image concatenation techniques and training strategies for visual ICL and introduce novel concatenation methods that significantly enhance model performance with limited labeled data. By integrating multiple augmentation strategies, FlexICL achieves robust segmentation performance across four wrist and elbow US datasets while requiring only 5% of the training images. It outperforms state-of-the-art visual ICL models like Painter, MAE-VQGAN, and conventional segmentation models like U-Net and TransUNet by 1-27% Dice coefficient on 1,252 US sweeps. These initial results highlight the potential of FlexICL as an efficient and scalable solution for US image segmentation well suited for medical imaging use cases where labeled data is scarce.

[125] Neighborhood Feature Pooling for Remote Sensing Image Classification

Fahimeh Orvati Nia, Amirmohammad Mohammadi, Salim Al Kharsa, Pragati Naikare, Zigfried Hampel-Arias, Joshua Peeples

Main category: cs.CV

TL;DR: Proposes neighborhood feature pooling (NFP) for texture feature extraction in remote sensing classification, improving performance across datasets with minimal parameter overhead.

DetailsMotivation: To develop an efficient texture feature extraction method for remote sensing image classification that captures local relationships and similarities.

Method: Uses neighborhood feature pooling (NFP) implemented with convolutional layers to aggregate local similarities across feature dimensions, seamlessly integrable into any network.

Result: NFP consistently improves performance across diverse datasets and architectures compared to baseline models.

Conclusion: NFP is an effective texture feature extraction method that enhances classification performance while maintaining minimal parameter overhead.

Abstract: In this work, we propose neighborhood feature pooling (NFP) as a novel texture feature extraction method for remote sensing image classification. The NFP layer captures relationships between neighboring inputs and efficiently aggregates local similarities across feature dimensions. Implemented using convolutional layers, NFP can be seamlessly integrated into any network. Results comparing the baseline models and the NFP method indicate that NFP consistently improves performance across diverse datasets and architectures while maintaining minimal parameter overhead.

[126] Dynamic VLM-Guided Negative Prompting for Diffusion Models

Hoyeon Chang, Seungjin Kim, Yoonseok Choi

Main category: cs.CV

TL;DR: Dynamic negative prompting using VLMs to generate adaptive negative prompts during diffusion model denoising, improving over fixed negative prompts.

DetailsMotivation: Traditional negative prompting uses fixed prompts, which may not be contextually appropriate throughout the denoising process. This paper aims to create adaptive negative prompts that better guide the generation.

Method: Generate intermediate image predictions at specific denoising steps, then query a Vision-Language Model to produce contextually appropriate negative prompts based on the current image state.

Result: Evaluated on benchmark datasets showing trade-offs between negative guidance strength and text-image alignment. The method demonstrates improved performance over fixed negative prompting approaches.

Conclusion: Dynamic negative prompting using VLMs provides a more effective approach than fixed negative prompts, with adaptive prompts that better guide the diffusion process while maintaining text-image alignment.

Abstract: We propose a novel approach for dynamic negative prompting in diffusion models that leverages Vision-Language Models (VLMs) to adaptively generate negative prompts during the denoising process. Unlike traditional Negative Prompting methods that use fixed negative prompts, our method generates intermediate image predictions at specific denoising steps and queries a VLM to produce contextually appropriate negative prompts. We evaluate our approach on various benchmark datasets and demonstrate the trade-offs between negative guidance strength and text-image alignment.

[127] Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers

Wei Pang, Kevin Qinghong Lin, Xiangru Jian, Xi He, Philip Torr

Main category: cs.CV

TL;DR: First benchmark and metric suite for academic poster generation, with PosterAgent pipeline outperforming GPT-4o using 87% fewer tokens at $0.005 per poster.

DetailsMotivation: Academic poster generation is challenging due to compressing long documents into visually coherent single pages, requiring automated solutions.

Method: PosterAgent pipeline: Parser distills paper into asset library, Planner creates binary-tree layout, Painter-Commenter loop refines panels using VLM feedback.

Result: Open-source variants outperform GPT-4o across metrics, achieving better visual quality, textual coherence, and PaperQuiz scores with 87% fewer tokens.

Conclusion: Charts directions for next-gen automated poster-generation models, with human-designed posters relying on visual semantics as key aesthetic bottleneck.

Abstract: Academic poster generation is a crucial yet challenging task in scientific communication, requiring the compression of long-context interleaved documents into a single, visually coherent page. To address this challenge, we introduce the first benchmark and metric suite for poster generation, which pairs recent conference papers with author-designed posters and evaluates outputs on (i)Visual Quality-semantic alignment with human posters, (ii)Textual Coherence-language fluency, (iii)Holistic Assessment-six fine-grained aesthetic and informational criteria scored by a VLM-as-judge, and notably (iv)PaperQuiz-the poster’s ability to convey core paper content as measured by VLMs answering generated quizzes. Building on this benchmark, we propose PosterAgent, a top-down, visual-in-the-loop multi-agent pipeline: the (a)Parser distills the paper into a structured asset library; the (b)Planner aligns text-visual pairs into a binary-tree layout that preserves reading order and spatial balance; and the (c)Painter-Commenter loop refines each panel by executing rendering code and using VLM feedback to eliminate overflow and ensure alignment. In our comprehensive evaluation, we find that GPT-4o outputs-though visually appealing at first glance-often exhibit noisy text and poor PaperQuiz scores, and we find that reader engagement is the primary aesthetic bottleneck, as human-designed posters rely largely on visual semantics to convey meaning. Our fully open-source variants (e.g. based on the Qwen-2.5 series) outperform existing 4o-driven multi-agent systems across nearly all metrics, while using 87% fewer tokens. It transforms a 22-page paper into a finalized yet editable .pptx poster - all for just $0.005. These findings chart clear directions for the next generation of fully automated poster-generation models. The code and datasets are available at https://github.com/Paper2Poster/Paper2Poster.

[128] Security Risk of Misalignment between Text and Image in Multi-modal Model

Xiaosen Wang, Zhijin Ge, Shaokang Wang

Main category: cs.CV

TL;DR: PReMA is a novel adversarial attack that manipulates multi-modal diffusion model outputs by modifying input images while keeping prompts fixed, enabling generation of inappropriate content without changing text inputs.

DetailsMotivation: Existing multi-modal diffusion models have inadequate alignment between text and image modalities, creating security risks for generating NSFW content. Current attacks focus on adversarial prompts, leaving image-based manipulation underexplored.

Method: Proposed Prompt-Restricted Multi-modal Attack (PReMA) creates adversarial images that, when combined with any specified prompt, manipulate model outputs without altering the prompt itself. It targets image-editing applications with fixed prompts.

Result: Comprehensive evaluations on image inpainting and style transfer tasks across various models demonstrate PReMA’s potent efficacy in manipulating generated content.

Conclusion: PReMA represents a novel threat to multi-modal diffusion model integrity, particularly for applications using fixed prompts, highlighting the need for improved modality alignment and security measures.

Abstract: Despite the notable advancements and versatility of multi-modal diffusion models, such as text-to-image models, their susceptibility to adversarial inputs remains underexplored. Contrary to expectations, our investigations reveal that the alignment between textual and Image modalities in existing diffusion models is inadequate. This misalignment presents significant risks, especially in the generation of inappropriate or Not-Safe-For-Work (NSFW) content. To this end, we propose a novel attack called Prompt-Restricted Multi-modal Attack (PReMA) to manipulate the generated content by modifying the input image in conjunction with any specified prompt, without altering the prompt itself. PReMA is the first attack that manipulates model outputs by solely creating adversarial images, distinguishing itself from prior methods that primarily generate adversarial prompts to produce NSFW content. Consequently, PReMA poses a novel threat to the integrity of multi-modal diffusion models, particularly in image-editing applications that operate with fixed prompts. Comprehensive evaluations conducted on image inpainting and style transfer tasks across various models confirm the potent efficacy of PReMA.

[129] EgoExo-Con: Exploring View-Invariant Video Temporal Understanding

Minjoon Jung, Junbin Xiao, Junghyun Kim, Byoung-Tak Zhang, Angela Yao

Main category: cs.CV

TL;DR: EgoExo-Con benchmark evaluates Video-LLMs’ temporal consistency across egocentric and exocentric viewpoints, revealing models struggle with cross-view consistency. Proposed View-GRPO framework improves consistency through reinforcement learning.

DetailsMotivation: To study whether Video-LLMs can achieve consistent temporal understanding when videos capture the same event from different viewpoints (egocentric vs exocentric).

Method: Introduced EgoExo-Con benchmark with synchronized video pairs and human-refined queries. Proposed View-GRPO reinforcement learning framework to strengthen view-specific reasoning while encouraging cross-view consistency.

Result: Existing Video-LLMs fail to maintain consistency across viewpoints, performing worse than single-view. Naive fine-tuning with synchronized videos improves consistency but underperforms single-view training. View-GRPO outperforms SFT and GRPO in improving cross-view consistency.

Conclusion: View-GRPO effectively addresses the cross-view consistency problem in Video-LLMs, demonstrating superiority over existing methods for temporal understanding across different viewpoints.

Abstract: Can Video-LLMs achieve consistent temporal understanding when videos capture the same event from different viewpoints? To study this, we introduce EgoExo-Con (Consistency), a benchmark of comprehensively synchronized egocentric and exocentric video pairs with human-refined queries in natural language. EgoExo-Con emphasizes two temporal understanding tasks: Temporal Verification and Temporal Grounding. It evaluates not only correctness but consistency across viewpoints. Our analysis reveals two critical limitations of existing Video-LLMs: (1) models often fail to maintain consistency, with results far worse than their single-view performances. (2) When naively finetuned with synchronized videos of both viewpoints, the models show improved consistency but often underperform those trained on a single view. For improvements, we propose View-GRPO, a novel reinforcement learning framework that effectively strengthens view-specific temporal reasoning while encouraging consistent comprehension across viewpoints. Our method demonstrates its superiority over naive SFT and GRPO, especially for improving cross-view consistency. All resources will be made publicly available.

[130] OracleAgent: A Multimodal Reasoning Agent for Oracle Bone Script Research

Caoshuo Li, Zengmao Ding, Xiaobin Hu, Bang Li, Donghao Luo, Xu Peng, Taisong Jin, Yongge Liu, Shengwei Han, Jing Yang, Xiaoping He, Feng Gao, AndyPian Wu, SevenShu, Chaoyang Wang, Chengjie Wang

Main category: cs.CV

TL;DR: OracleAgent is the first agent system for Oracle Bone Script (OBS) research that integrates multiple analysis tools with LLMs and uses a comprehensive multimodal knowledge base to address challenges in OBS interpretation and information retrieval.

DetailsMotivation: Current OBS research faces challenges in complex interpretation workflows and inefficient information organization/retrieval, where scholars spend substantial effort searching for and managing resources.

Method: Developed OracleAgent system that integrates multiple OBS analysis tools powered by LLMs, with flexible orchestration capabilities. Built a comprehensive multimodal knowledge base containing over 1.4M single-character rubbing images and 80K interpretation texts through multi-year data collection and expert annotation.

Result: OracleAgent achieves superior performance in multimodal reasoning and generation tasks, surpassing leading MLLMs like GPT-4o. Case studies show it significantly reduces time cost for domain experts in OBS research.

Conclusion: OracleAgent represents a significant step toward practical deployment of OBS-assisted research and automated interpretation systems, effectively addressing the major challenges in current OBS research.

Abstract: As one of the earliest writing systems, Oracle Bone Script (OBS) preserves the cultural and intellectual heritage of ancient civilizations. However, current OBS research faces two major challenges: (1) the interpretation of OBS involves a complex workflow comprising multiple serial and parallel sub-tasks, and (2) the efficiency of OBS information organization and retrieval remains a critical bottleneck, as scholars often spend substantial effort searching for, compiling, and managing relevant resources. To address these challenges, we present OracleAgent, the first agent system designed for the structured management and retrieval of OBS-related information. OracleAgent seamlessly integrates multiple OBS analysis tools, empowered by large language models (LLMs), and can flexibly orchestrate these components. Additionally, we construct a comprehensive domain-specific multimodal knowledge base for OBS, which is built through a rigorous multi-year process of data collection, cleaning, and expert annotation. The knowledge base comprises over 1.4M single-character rubbing images and 80K interpretation texts. OracleAgent leverages this resource through its multimodal tools to assist experts in retrieval tasks of character, document, interpretation text, and rubbing image. Extensive experiments demonstrate that OracleAgent achieves superior performance across a range of multimodal reasoning and generation tasks, surpassing leading mainstream multimodal large language models (MLLMs) (e.g., GPT-4o). Furthermore, our case study illustrates that OracleAgent can effectively assist domain experts, significantly reducing the time cost of OBS research. These results highlight OracleAgent as a significant step toward the practical deployment of OBS-assisted research and automated interpretation systems.

[131] JOGS: Joint Optimization of Pose Estimation and 3D Gaussian Splatting

Yuxuan Li, Tao Wang, Xianben Yang

Main category: cs.CV

TL;DR: A unified framework that jointly optimizes 3D Gaussian points and camera poses without requiring pre-calibrated camera poses, outperforming traditional COLMAP-based methods.

DetailsMotivation: Traditional novel view synthesis methods rely on external camera pose estimation tools like COLMAP, which introduce computational bottlenecks and propagate errors.

Method: A co-optimization strategy that iteratively refines 3D Gaussian parameters and camera poses through two interleaved phases: updating 3D Gaussian parameters via differentiable rendering with fixed poses, and refining camera poses using a customized 3D optical flow algorithm with geometric and photometric constraints.

Result: Extensive evaluations show the approach significantly outperforms existing COLMAP-free techniques in reconstruction quality and surpasses standard COLMAP-based baselines in general, particularly in challenging scenarios with large viewpoint variations and sparse feature distributions.

Conclusion: The proposed unified framework effectively eliminates the need for pre-calibrated camera poses while achieving superior reconstruction quality and pose accuracy compared to traditional methods.

Abstract: Traditional novel view synthesis methods heavily rely on external camera pose estimation tools such as COLMAP, which often introduce computational bottlenecks and propagate errors. To address these challenges, we propose a unified framework that jointly optimizes 3D Gaussian points and camera poses without requiring pre-calibrated inputs. Our approach iteratively refines 3D Gaussian parameters and updates camera poses through a novel co-optimization strategy, ensuring simultaneous improvements in scene reconstruction fidelity and pose accuracy. The key innovation lies in decoupling the joint optimization into two interleaved phases: first, updating 3D Gaussian parameters via differentiable rendering with fixed poses, and second, refining camera poses using a customized 3D optical flow algorithm that incorporates geometric and photometric constraints. This formulation progressively reduces projection errors, particularly in challenging scenarios with large viewpoint variations and sparse feature distributions, where traditional methods struggle. Extensive evaluations on multiple datasets demonstrate that our approach significantly outperforms existing COLMAP-free techniques in reconstruction quality, and also surpasses the standard COLMAP-based baseline in general.

[132] WOD-E2E: Waymo Open Dataset for End-to-End Driving in Challenging Long-tail Scenarios

Runsheng Xu, Hubert Lin, Wonseok Jeon, Hao Feng, Yuliang Zou, Liting Sun, John Gorman, Kate Tolstaya, Sarah Tang, Brandyn White, Ben Sapp, Mingxing Tan, Jyh-Jing Hwang, Drago Anguelov

Main category: cs.CV

TL;DR: WOD-E2E is a new dataset for end-to-end driving that focuses on challenging long-tail scenarios occurring less than 0.03% of the time, with 4,021 segments and 12 hours of data, featuring routing info, ego states, and 360-degree camera views.

DetailsMotivation: Current E2E driving benchmarks mainly test nominal scenarios and lack adequate evaluation of long-tail situations. Existing metrics fail to capture driving's multi-modal nature and performance in rare scenarios.

Method: Created WOD-E2E dataset with curated long-tail scenarios, and proposed Rater Feedback Score (RFS) metric that measures how well predicted trajectories match human rater preference labels instead of just waypoint distances.

Result: Released dataset with 4,021 driving segments and rater preference labels for validation set. Test set labels used for the 2025 WOD-E2E Challenge to benchmark performance on rare scenarios.

Conclusion: WOD-E2E aims to advance research in generalizable, robust, and safe end-to-end autonomous driving by providing challenging long-tail scenarios and better evaluation metrics for complex real-world situations.

Abstract: Vision-based end-to-end (E2E) driving has garnered significant interest in the research community due to its scalability and synergy with multimodal large language models (MLLMs). However, current E2E driving benchmarks primarily feature nominal scenarios, failing to adequately test the true potential of these systems. Furthermore, existing open-loop evaluation metrics often fall short in capturing the multi-modal nature of driving or effectively evaluating performance in long-tail scenarios. To address these gaps, we introduce the Waymo Open Dataset for End-to-End Driving (WOD-E2E). WOD-E2E contains 4,021 driving segments (approximately 12 hours), specifically curated for challenging long-tail scenarios that that are rare in daily life with an occurring frequency of less than 0.03%. Concretely, each segment in WOD-E2E includes the high-level routing information, ego states, and 360-degree camera views from 8 surrounding cameras. To evaluate the E2E driving performance on these long-tail situations, we propose a novel open-loop evaluation metric: Rater Feedback Score (RFS). Unlike conventional metrics that measure the distance between predicted way points and the logs, RFS measures how closely the predicted trajectory matches rater-annotated trajectory preference labels. We have released rater preference labels for all WOD-E2E validation set segments, while the held out test set labels have been used for the 2025 WOD-E2E Challenge. Through our work, we aim to foster state of the art research into generalizable, robust, and safe end-to-end autonomous driving agents capable of handling complex real-world situations.

[133] Exploring Object-Aware Attention Guided Frame Association for RGB-D SLAM

Ali Caglayan, Nevrez Imamoglu, Oguzhan Guclu, Ali Osman Serhatoglu, Ahmet Burak Can, Ryosuke Nakamura

Main category: cs.CV

TL;DR: The paper proposes integrating gradient-based attention information with CNN features to enhance RGB-D indoor SLAM performance, showing improved results especially in large environments.

DetailsMotivation: Current CNN representations for semantic object understanding lack explicit integration of gradient-based attention information, which could benefit visual tasks like SLAM by enriching representations with spatially attentive object locations.

Method: Integrate layer-wise attention information derived from network gradients with CNN feature representations to improve frame association performance in RGB-D indoor SLAM.

Result: Experimental results show improved performance compared to baseline methods, with particular enhancement for large environments.

Conclusion: Integrating task-specific network attention with CNN features effectively improves SLAM performance, demonstrating the value of explicit gradient-based attention integration for visual tasks.

Abstract: Attention models have recently emerged as a powerful approach, demonstrating significant progress in various fields. Visualization techniques, such as class activation mapping, provide visual insights into the reasoning of convolutional neural networks (CNNs). Using network gradients, it is possible to identify regions where the network pays attention during image recognition tasks. Furthermore, these gradients can be combined with CNN features to localize more generalizable, task-specific attentive (salient) regions within scenes. However, explicit use of this gradient-based attention information integrated directly into CNN representations for semantic object understanding remains limited. Such integration is particularly beneficial for visual tasks like simultaneous localization and mapping (SLAM), where CNN representations enriched with spatially attentive object locations can enhance performance. In this work, we propose utilizing task-specific network attention for RGB-D indoor SLAM. Specifically, we integrate layer-wise attention information derived from network gradients with CNN feature representations to improve frame association performance. Experimental results indicate improved performance compared to baseline methods, particularly for large environments.

[134] FullPart: Generating each 3D Part at Full Resolution

Lihe Ding, Shaocong Dong, Yaokun Li, Chenjian Gao, Xiao Chen, Rui Han, Yihao Kuang, Hong Zhang, Bo Huang, Zhanpeng Huang, Zibin Wang, Dan Xu, Tianfan Xue

Main category: cs.CV

TL;DR: FullPart combines implicit and explicit paradigms for 3D part generation, using implicit diffusion for bounding box layout and full-resolution voxel grids for detailed part generation, achieving state-of-the-art results.

DetailsMotivation: Previous methods either use implicit representations with insufficient geometric details or explicit voxel representations where small parts occupy too few voxels, leading to degraded quality.

Method: First uses implicit box vector-set diffusion for bounding box layout, then generates detailed parts in individual full-resolution voxel grids with center-point encoding to maintain global coherence.

Result: Achieves state-of-the-art results in 3D part generation, enabling synthesis of intricate details even for small parts by generating each part at full resolution.

Conclusion: FullPart effectively combines implicit and explicit paradigms to overcome limitations of previous methods, and introduces PartVerse-XL dataset to advance 3D part generation research.

Abstract: Part-based 3D generation holds great potential for various applications. Previous part generators that represent parts using implicit vector-set tokens often suffer from insufficient geometric details. Another line of work adopts an explicit voxel representation but shares a global voxel grid among all parts; this often causes small parts to occupy too few voxels, leading to degraded quality. In this paper, we propose FullPart, a novel framework that combines both implicit and explicit paradigms. It first derives the bounding box layout through an implicit box vector-set diffusion process, a task that implicit diffusion handles effectively since box tokens contain little geometric detail. Then, it generates detailed parts, each within its own fixed full-resolution voxel grid. Instead of sharing a global low-resolution space, each part in our method - even small ones - is generated at full resolution, enabling the synthesis of intricate details. We further introduce a center-point encoding strategy to address the misalignment issue when exchanging information between parts of different actual sizes, thereby maintaining global coherence. Moreover, to tackle the scarcity of reliable part data, we present PartVerse-XL, the largest human-annotated 3D part dataset to date with 40K objects and 320K parts. Extensive experiments demonstrate that FullPart achieves state-of-the-art results in 3D part generation. We will release all code, data, and model to benefit future research in 3D part generation.

[135] BasicAVSR: Arbitrary-Scale Video Super-Resolution via Image Priors and Enhanced Motion Compensation

Wei Shang, Wanying Zhang, Shuhang Gu, Pengfei Zhu, Qinghua Hu, Dongwei Ren

Main category: cs.CV

TL;DR: BasicAVSR is a strong baseline for arbitrary-scale video super-resolution that integrates adaptive multi-scale frequency priors, flow-guided propagation, second-order motion compensation, and hyper-upsampling to achieve high-quality, temporally consistent video enhancement at various scaling factors.

DetailsMotivation: Arbitrary-scale video super-resolution faces challenges in spatial detail reproduction, temporal consistency, and computational complexity. Existing methods struggle to handle diverse scaling factors while maintaining quality and efficiency across different application scenarios.

Method: Proposes BasicAVSR with four key components: 1) adaptive multi-scale frequency priors from Laplacian pyramids, 2) flow-guided propagation unit for spatiotemporal aggregation, 3) second-order motion compensation for accurate frame alignment, and 4) hyper-upsampling unit for scale-aware upsampling. Includes three propagation variants for different scenarios: unidirectional RNN (online), limited lookahead RNN (small delay), and bidirectional RNN (offline).

Result: Experimental results show BasicAVSR significantly outperforms existing methods in super-resolution quality, generalization ability, and inference speed. The model demonstrates effectiveness and adaptability across different scenarios and propagation variants.

Conclusion: BasicAVSR advances state-of-the-art in arbitrary-scale video super-resolution and extends its core components to multiple frameworks for diverse application scenarios, providing a strong baseline with superior performance across quality, generalization, and speed metrics.

Abstract: Arbitrary-scale video super-resolution (AVSR) aims to enhance the resolution of video frames, potentially at various scaling factors, which presents several challenges regarding spatial detail reproduction, temporal consistency, and computational complexity. In this paper, we propose a strong baseline BasicAVSR for AVSR by integrating four key components: 1) adaptive multi-scale frequency priors generated from image Laplacian pyramids, 2) a flow-guided propagation unit to aggregate spatiotemporal information from adjacent frames, 3) a second-order motion compensation unit for more accurate spatial alignment of adjacent frames, and 4) a hyper-upsampling unit to generate scale-aware and content-independent upsampling kernels. To meet diverse application demands, we instantiate three propagation variants: (i) a unidirectional RNN unit for strictly online inference, (ii) a unidirectional RNN unit empowered with a limited lookahead that tolerates a small output delay, and (iii) a bidirectional RNN unit designed for offline tasks where computational resources are less constrained. Experimental results demonstrate the effectiveness and adaptability of our model across these different scenarios. Through extensive experiments, we show that BasicAVSR significantly outperforms existing methods in terms of super-resolution quality, generalization ability, and inference speed. Our work not only advances the state-of-the-art in AVSR but also extends its core components to multiple frameworks for diverse scenarios. The code is available at https://github.com/shangwei5/BasicAVSR.

[136] MV-MLM: Bridging Multi-View Mammography and Language for Breast Cancer Diagnosis and Risk Prediction

Shunjie-Fabian Zheng, Hyeonjun Lee, Thijs Kooi, Ali Diba

Main category: cs.CV

TL;DR: A novel Multi-View Mammography and Language Model (MV-MLM) that uses vision-language pre-training with synthetic radiology reports to achieve state-of-the-art performance in breast cancer classification and risk prediction tasks.

DetailsMotivation: Large annotated datasets for breast cancer CAD models are costly and time-consuming to acquire. Vision-Language Models offer a promising solution for enhanced robustness and data efficiency in medical imaging.

Method: Leverages multi-view supervision with cross-modal self-supervision across image-text pairs, using multiple mammogram views and corresponding pseudo-radiology reports. Employs a joint visual-textual learning strategy to enhance generalization across different data types and tasks.

Result: Achieves state-of-the-art performance in three classification tasks: malignancy classification, subtype classification, and image-based cancer risk prediction. Demonstrates strong data efficiency, outperforming fully supervised and VLM baselines while trained on synthetic text reports without actual radiology reports.

Conclusion: The proposed MV-MLM model effectively addresses data annotation challenges in medical imaging by leveraging vision-language pre-training with synthetic reports, achieving superior performance in breast cancer analysis tasks with enhanced data efficiency.

Abstract: Large annotated datasets are essential for training robust Computer-Aided Diagnosis (CAD) models for breast cancer detection or risk prediction. However, acquiring such datasets with fine-detailed annotation is both costly and time-consuming. Vision-Language Models (VLMs), such as CLIP, which are pre-trained on large image-text pairs, offer a promising solution by enhancing robustness and data efficiency in medical imaging tasks. This paper introduces a novel Multi-View Mammography and Language Model for breast cancer classification and risk prediction, trained on a dataset of paired mammogram images and synthetic radiology reports. Our MV-MLM leverages multi-view supervision to learn rich representations from extensive radiology data by employing cross-modal self-supervision across image-text pairs. This includes multiple views and the corresponding pseudo-radiology reports. We propose a novel joint visual-textual learning strategy to enhance generalization and accuracy performance over different data types and tasks to distinguish breast tissues or cancer characteristics(calcification, mass) and utilize these patterns to understand mammography images and predict cancer risk. We evaluated our method on both private and publicly available datasets, demonstrating that the proposed model achieves state-of-the-art performance in three classification tasks: (1) malignancy classification, (2) subtype classification, and (3) image-based cancer risk prediction. Furthermore, the model exhibits strong data efficiency, outperforming existing fully supervised or VLM baselines while trained on synthetic text reports and without the need for actual radiology reports.

[137] Detecting Unauthorized Vehicles using Deep Learning for Smart Cities: A Case Study on Bangladesh

Sudipto Das Sukanto, Diponker Roy, Fahim Shakil, Nirjhar Singha, Abdullah Asik, Aniket Joarder, Mridha Md Nafis Fuad, Muhammad Ibrahim

Main category: cs.CV

TL;DR: This paper presents a YOLOv8-based machine learning system for real-time auto-rickshaw detection in traffic images, achieving 83.447% mAP50 and over 78% precision/recall.

DetailsMotivation: Auto-rickshaw monitoring is necessary due to traffic restrictions, but existing surveillance systems struggle to distinguish them from similar vehicles like non-auto rickshaws, and manual video analysis is too time-consuming.

Method: Used YOLOv8 model for real-time object detection, trained on a custom dataset of 1,730 annotated images captured under various traffic conditions.

Result: The model achieved mAP50 of 83.447% and binary precision and recall values above 78%, performing well in both dense and sparse traffic scenarios.

Conclusion: The proposed machine learning approach effectively automates auto-rickshaw detection in real-time, and the dataset has been publicly released for further research.

Abstract: Modes of transportation vary across countries depending on geographical location and cultural context. In South Asian countries rickshaws are among the most common means of local transport. Based on their mode of operation, rickshaws in cities across Bangladesh can be broadly classified into non-auto (pedal-powered) and auto-rickshaws (motorized). Monitoring the movement of auto-rickshaws is necessary as traffic rules often restrict auto-rickshaws from accessing certain routes. However, existing surveillance systems make it quite difficult to monitor them due to their similarity to other vehicles, especially non-auto rickshaws whereas manual video analysis is too time-consuming. This paper presents a machine learning-based approach to automatically detect auto-rickshaws in traffic images. In this system, we used real-time object detection using the YOLOv8 model. For training purposes, we prepared a set of 1,730 annotated images that were captured under various traffic conditions. The results show that our proposed model performs well in real-time auto-rickshaw detection and offers an mAP50 of 83.447% and binary precision and recall values above 78%, demonstrating its effectiveness in handling both dense and sparse traffic scenarios. The dataset has been publicly released for further research.

[138] CRAG-MM: Multi-modal Multi-turn Comprehensive RAG Benchmark

Jiaqi Wang, Xiao Yang, Kai Sun, Parth Suresh, Sanat Sharma, Adam Czyzewski, Derek Andersen, Surya Appini, Arkav Banerjee, Sajal Choudhary, Shervin Ghasemlou, Ziqiang Guan, Akil Iyer, Haidar Khan, Lingkun Kong, Roy Luo, Tiffany Ma, Zhen Qiao, David Tran, Wenfang Xu, Skyler Yeatman, Chen Zhou, Gunveer Gujral, Yinglong Xia, Shane Moon, Nicolas Scheffer, Nirav Shah, Eun Chang, Yue Liu, Florian Metze, Tammy Stark, Zhaleh Feizollahi, Andrea Jessee, Mangesh Pujari, Ahmed Aly, Babak Damavandi, Rakesh Wanga, Anuj Kumar, Rohit Patel, Wen-tau Yih, Xin Luna Dong

Main category: cs.CV

TL;DR: CRAG-MM is a comprehensive benchmark for multi-modal RAG with 6.5K image-question-answer triplets and 2K multi-turn conversations across 13 domains, designed for wearable device scenarios.

DetailsMotivation: There is no comprehensive benchmark for multi-modal RAG in wearable scenarios, despite its importance for smart glasses and similar devices.

Method: Created diverse dataset with 6.2K egocentric images, multiple question types, image-quality issues, entity popularity variations, and different conversation turns. Designed three tasks with retrieval corpus and APIs.

Result: Baseline RAG approaches achieve only 32-43% truthfulness, similar to state-of-the-art industry solutions. Winning KDD Cup solutions improved baseline by 28%.

Conclusion: CRAG-MM fills an important gap and shows significant room for improvement in multi-modal RAG systems, with early impact demonstrated through competitions.

Abstract: Wearable devices such as smart glasses are transforming the way people interact with their surroundings, enabling users to seek information regarding entities in their view. Multi-Modal Retrieval-Augmented Generation (MM-RAG) plays a key role in supporting such questions, yet there is still no comprehensive benchmark for this task, especially regarding wearables scenarios. To fill this gap, we present CRAG-MM – a Comprehensive RAG benchmark for Multi-modal Multi-turn conversations. CRAG-MM contains a diverse set of 6.5K (image, question, answer) triplets and 2K visual-based multi-turn conversations across 13 domains, including 6.2K egocentric images designed to mimic captures from wearable devices. We carefully constructed the questions to reflect real-world scenarios and challenges, including five types of image-quality issues, six question types, varying entity popularity, differing information dynamism, and different conversation turns. We design three tasks: single-source augmentation, multi-source augmentation, and multi-turn conversations – each paired with an associated retrieval corpus and APIs for both image-KG retrieval and webpage retrieval. Our evaluation shows that straightforward RAG approaches achieve only 32% and 43% truthfulness on CRAG-MM single- and multi-turn QA, respectively, whereas state-of-the-art industry solutions have similar quality (32%/45%), underscoring ample room for improvement. The benchmark has hosted KDD Cup 2025, attracting about 1K participants and 5K submissions, with winning solutions improving baseline performance by 28%, highlighting its early impact on advancing the field.

[139] MoTDiff: High-resolution Motion Trajectory estimation from a single blurred image using Diffusion models

Wontae Choi, Jaelin Lee, Hyung Sup Yun, Byeungwoo Jeon, Il Yong Chun

Main category: cs.CV

TL;DR: MoTDiff is a high-resolution motion trajectory estimation framework using diffusion models that extracts fine-grained motion information from single motion-blurred images, outperforming state-of-the-art methods in blind image deblurring and coded exposure photography.

DetailsMotivation: Existing motion representations from single blurred images are often low quality - coarse-grained and inaccurate. Accurate motion estimation is crucial for computational imaging and computer vision applications.

Method: Proposes MoTDiff framework with two key components: 1) conditional diffusion framework using multi-scale feature maps from blurred images as condition, 2) training method promoting precise identification of fine-grained motion trajectory, consistent shape/position estimation, and pixel connectivity.

Result: MoTDiff outperforms state-of-the-art methods in both blind image deblurring and coded exposure photography applications.

Conclusion: The proposed MoTDiff framework successfully estimates high-resolution motion trajectories from single motion-blurred images using diffusion models, achieving superior performance compared to existing methods.

Abstract: Accurate estimation of motion information is crucial in diverse computational imaging and computer vision applications. Researchers have investigated various methods to extract motion information from a single blurred image, including blur kernels and optical flow. However, existing motion representations are often of low quality, i.e., coarse-grained and inaccurate. In this paper, we propose the first high-resolution (HR) Motion Trajectory estimation framework using Diffusion models (MoTDiff). Different from existing motion representations, we aim to estimate an HR motion trajectory with high-quality from a single motion-blurred image. The proposed MoTDiff consists of two key components: 1) a new conditional diffusion framework that uses multi-scale feature maps extracted from a single blurred image as a condition, and 2) a new training method that can promote precise identification of a fine-grained motion trajectory, consistent estimation of overall shape and position of a motion path, and pixel connectivity along a motion trajectory. Our experiments demonstrate that the proposed MoTDiff can outperform state-of-the-art methods in both blind image deblurring and coded exposure photography applications.

[140] ConceptScope: Characterizing Dataset Bias via Disentangled Visual Concepts

Jinho Choi, Hyesu Lim, Steffen Schneider, Jaegul Choo

Main category: cs.CV

TL;DR: ConceptScope is an automated framework that uses Sparse Autoencoders on vision foundation models to discover and quantify human-interpretable concepts in datasets, enabling systematic bias identification and dataset analysis without requiring fine-grained annotations.

DetailsMotivation: Dataset bias is common in machine learning but challenging to identify without costly manual annotations. There's a need for scalable, automated methods to systematically discover and quantify biases in visual datasets.

Method: Uses Sparse Autoencoders trained on representations from vision foundation models to discover interpretable concepts. Categorizes concepts into target, context, and bias types based on semantic relevance and statistical correlation to class labels. Enables concept-based subgrouping for bias identification and robustness evaluation.

Result: Validated to capture diverse visual concepts (objects, textures, backgrounds, facial attributes, emotions, actions). Concept activations produce spatial attributions aligned with meaningful image regions. Successfully detected known biases (background bias in Waterbirds) and uncovered unannotated biases (co-occurring objects in ImageNet).

Conclusion: ConceptScope provides a practical tool for automated dataset auditing and model diagnostics, offering scalable bias detection without requiring expensive manual annotations.

Abstract: Dataset bias, where data points are skewed to certain concepts, is ubiquitous in machine learning datasets. Yet, systematically identifying these biases is challenging without costly, fine-grained attribute annotations. We present ConceptScope, a scalable and automated framework for analyzing visual datasets by discovering and quantifying human-interpretable concepts using Sparse Autoencoders trained on representations from vision foundation models. ConceptScope categorizes concepts into target, context, and bias types based on their semantic relevance and statistical correlation to class labels, enabling class-level dataset characterization, bias identification, and robustness evaluation through concept-based subgrouping. We validate that ConceptScope captures a wide range of visual concepts, including objects, textures, backgrounds, facial attributes, emotions, and actions, through comparisons with annotated datasets. Furthermore, we show that concept activations produce spatial attributions that align with semantically meaningful image regions. ConceptScope reliably detects known biases (e.g., background bias in Waterbirds) and uncovers previously unannotated ones (e.g, co-occurring objects in ImageNet), offering a practical tool for dataset auditing and model diagnostics.

[141] Sketch2PoseNet: Efficient and Generalized Sketch to 3D Human Pose Prediction

Li Wang, Yiyu Zhuang, Yanwen Wang, Xun Cao, Chuan Guo, Xinxin Zuo, Hao Zhu

Main category: cs.CV

TL;DR: A novel approach for 3D human pose estimation from sketches using a “learn from synthesis” strategy with diffusion models to create synthetic sketch-pose datasets, enabling efficient and accurate pose estimation across diverse sketch styles.

DetailsMotivation: Traditional sketch-to-pose methods are limited by lack of large-scale sketch-3D pose annotations and rely on time-consuming optimization with heuristic rules, which have poor generalizability.

Method: Uses diffusion model to synthesize sketch images from 2D poses projected from 3D human poses, creating SKEP-120K synthetic dataset. Combines 2D pose detectors, generative diffusion priors, and feed-forward neural network with heuristic loss functions for geometric coherence and self-contact preservation.

Result: Qualitative, quantitative, and subjective evaluations show the model substantially surpasses previous methods in both estimation accuracy and speed for sketch-to-pose tasks.

Conclusion: The proposed data-driven framework with synthetic dataset generation enables efficient and accurate 3D human pose estimation from diverse sketch styles, overcoming limitations of previous optimization-based approaches.

Abstract: 3D human pose estimation from sketches has broad applications in computer animation and film production. Unlike traditional human pose estimation, this task presents unique challenges due to the abstract and disproportionate nature of sketches. Previous sketch-to-pose methods, constrained by the lack of large-scale sketch-3D pose annotations, primarily relied on optimization with heuristic rules-an approach that is both time-consuming and limited in generalizability. To address these challenges, we propose a novel approach leveraging a “learn from synthesis” strategy. First, a diffusion model is trained to synthesize sketch images from 2D poses projected from 3D human poses, mimicking disproportionate human structures in sketches. This process enables the creation of a synthetic dataset, SKEP-120K, consisting of 120k accurate sketch-3D pose annotation pairs across various sketch styles. Building on this synthetic dataset, we introduce an end-to-end data-driven framework for estimating human poses and shapes from diverse sketch styles. Our framework combines existing 2D pose detectors and generative diffusion priors for sketch feature extraction with a feed-forward neural network for efficient 2D pose estimation. Multiple heuristic loss functions are incorporated to guarantee geometric coherence between the derived 3D poses and the detected 2D poses while preserving accurate self-contacts. Qualitative, quantitative, and subjective evaluations collectively show that our model substantially surpasses previous ones in both estimation accuracy and speed for sketch-to-pose tasks.

[142] Developing a Multi-task Ensemble Geometric Deep Network for Supply Chain Sustainability and Risk Management

Mehdi Khaleghi, Nastaran Khaleghi, Sobhan Sheykhivand, Sebelan Danishvar

Main category: cs.CV

TL;DR: A novel Chebyshev ensemble geometric network (Ch-EGN) is proposed for supply chain sustainability, achieving high accuracy in risk management (98.95%), product classification (100%), and relationship classification (up to 98.07%) across two datasets.

DetailsMotivation: Supply chain sustainability requires effective risk management and product classification. Recent deep learning advancements provide opportunities to analyze complex supply chain dependencies and improve performance.

Method: Proposed Chebyshev ensemble geometric network (Ch-EGN) - a hybrid convolutional and geometric deep learning approach that leverages information dependencies in supply chains to derive invisible states from database samples.

Result: Achieved 98.95% accuracy for risk management (delivery status prediction), 100% accuracy for 5 product group classification, 98.07% for 4 product relation classification, and 92.37% for 25 company relation classification. Outperformed state-of-the-art methods.

Conclusion: The Ch-EGN ensemble network effectively enhances supply chain sustainability through superior performance in risk management and classification tasks, demonstrating significant improvements over existing approaches.

Abstract: The sustainability of supply chain plays a key role in achieving optimal performance in controlling the supply chain. The management of risks that occur in a supply chain is a fundamental problem for the purpose of developing the sustainability of the network and elevating the performance efficiency of the supply chain. The correct classification of products is another essential element in a sustainable supply chain. Acknowledging recent breakthroughs in the context of deep networks, several architectural options have been deployed to analyze supply chain datasets. A novel geometric deep network is used to propose an ensemble deep network. The proposed Chebyshev ensemble geometric network (Ch-EGN) is a hybrid convolutional and geometric deep learning. This network is proposed to leverage the information dependencies in supply chain to derive invisible states of samples in the database. The functionality of the proposed deep network is assessed on the two different databases. The SupplyGraph Dataset and DataCo are considered in this research. The prediction of delivery status of DataCo supply chain is done for risk administration. The product classification and edge classification are performed using the SupplyGraph database to enhance the sustainability of the supply network. An average accuracy of 98.95% is obtained for the ensemble network for risk management. The average accuracy of 100% and 98.07% are obtained for sustainable supply chain in terms of 5 product group classification and 4 product relation classification, respectively. The average accuracy of 92.37% is attained for 25 company relation classification. The results confirm an average improvement and efficiency of the proposed method compared to the state-of-the-art approaches.

[143] OmniLayout: Enabling Coarse-to-Fine Learning with LLMs for Universal Document Layout Generation

Hengrui Kang, Zhuangcheng Gu, Zhiyuan Zhao, Zichen Wen, Bin Wang, Weijia Li, Conghui He

Main category: cs.CV

TL;DR: The paper introduces OmniLayout-1M, a million-scale diverse document layout dataset, and OmniLayout-LLM, a 0.5B model with a coarse-to-fine learning paradigm for document layout generation.

DetailsMotivation: Document layout generation remains underexplored compared to document layout analysis, with existing datasets dominated by academic papers and lacking diversity in open-world document types like newspapers and magazines.

Method: Proposes OmniLayout-LLM, a 0.5B model with a two-stage coarse-to-fine learning paradigm: 1) learning universal layout principles from OmniLayout-1M with coarse categories, and 2) transferring knowledge to specific domains with fine-grained annotations.

Result: The approach achieves strong performance on multiple domains in the M$^{6}$Doc dataset, substantially surpassing both existing layout generation experts and several latest general-purpose LLMs.

Conclusion: The work addresses the scarcity of diverse document layouts and introduces an effective method for document layout generation that outperforms existing approaches across multiple domains.

Abstract: Document AI has advanced rapidly and is attracting increasing attention. Yet, while most efforts have focused on document layout analysis (DLA), its generative counterpart, document layout generation, remains underexplored. A major obstacle lies in the scarcity of diverse layouts: academic papers with Manhattan-style structures dominate existing studies, while open-world genres such as newspapers and magazines remain severely underrepresented. To address this gap, we curate OmniLayout-1M, the first million-scale dataset of diverse document layouts, covering six common document types and comprising contemporary layouts collected from multiple sources. Moreover, since existing methods struggle in complex domains and often fail to arrange long sequences coherently, we introduce OmniLayout-LLM, a 0.5B model with designed two-stage Coarse-to-Fine learning paradigm: 1) learning universal layout principles from OmniLayout-1M with coarse category definitions, and 2) transferring the knowledge to a specific domain with fine-grained annotations. Extensive experiments demonstrate that our approach achieves strong performance on multiple domains in M$^{6}$Doc dataset, substantially surpassing both existing layout generation experts and several latest general-purpose LLMs. Our code, models, and dataset will be publicly released.

[144] Which Way Does Time Flow? A Psychophysics-Grounded Evaluation for Vision-Language Models

Shiho Matta, Lis Kanashiro Pereira, Peitao Han, Fei Cheng, Shigeru Kitazawa

Main category: cs.CV

TL;DR: VLMs struggle with temporal reasoning, performing poorly on judging video direction (forward vs backward) compared to humans, especially for irreversible physical processes and causal actions.

DetailsMotivation: Current vision-language models have weak temporal understanding in videos, which is under-evaluated. The paper aims to test VLMs' ability to infer temporal direction using psychophysically validated stimuli.

Method: Created AoT-PsyPhyBENCH benchmark with natural videos and human behavioral baselines. Evaluated various VLMs (open-weight, proprietary, reasoning and non-reasoning) on arrow of time judgment task.

Result: Most models performed near chance level. Even the best models lagged far behind human accuracy, especially on physically irreversible processes (free fall, diffusion/explosion) and causal manual actions.

Conclusion: There is a fundamental gap in multimodal systems - they capture visual-semantic correlations but lack inductive biases for temporal continuity and causal understanding. The benchmark is released to encourage progress in temporal reasoning.

Abstract: Modern vision-language models (VLMs) excel at many multimodal tasks, yet their grasp of temporal information in video remains weak and, crucially, under-evaluated. We probe this gap with a deceptively simple but revealing challenge: judging the arrow of time (AoT)-whether a short clip is played forward or backward. We introduce AoT-PsyPhyBENCH, a psychophysically validated benchmark that tests whether VLMs can infer temporal direction in natural videos using the same stimuli and behavioral baselines established for humans. Our comprehensive evaluation of open-weight and proprietary, reasoning and non-reasoning VLMs reveals that most models perform near chance, and even the best lag far behind human accuracy on physically irreversible processes (e.g., free fall, diffusion/explosion) and causal manual actions (division/addition) that humans recognize almost instantly. These results highlight a fundamental gap in current multimodal systems: while they capture rich visual-semantic correlations, they lack the inductive biases required for temporal continuity and causal understanding. We release the code and data for AoT-PsyPhyBENCH to encourage further progress in the physical and temporal reasoning capabilities of VLMs.

[145] Revisiting Generative Infrared and Visible Image Fusion Based on Human Cognitive Laws

Lin Guo, Xiaoqing Luo, Wei Xie, Zhancheng Zhang, Hui Li, Rui Wang, Zhenhua Feng, Xiaoning Song

Main category: cs.CV

TL;DR: HCLFuse is a novel infrared and visible image fusion method inspired by human cognitive laws, using multi-scale mask-regulated variational bottleneck encoding and diffusion models with time-varying physical guidance to achieve state-of-the-art fusion performance.

DetailsMotivation: Existing fusion methods struggle with balancing modal information and lack interpretability in information selection, affecting reliability in complex scenarios. Generative methods have limited capabilities and poor interpretability.

Method: Uses multi-scale mask-regulated variational bottleneck encoder with posterior probability modeling and information decomposition for low-level modal information extraction. Integrates diffusion model’s probabilistic generation with physical laws through time-varying physical guidance mechanism.

Result: Achieves state-of-the-art fusion performance in qualitative and quantitative evaluations across multiple datasets, significantly improves semantic segmentation metrics, enhances structural consistency and detail quality.

Conclusion: The human cognition-inspired generative fusion method effectively enhances structural consistency and detail quality, demonstrating advantages in infrared and visible image fusion.

Abstract: Existing infrared and visible image fusion methods often face the dilemma of balancing modal information. Generative fusion methods reconstruct fused images by learning from data distributions, but their generative capabilities remain limited. Moreover, the lack of interpretability in modal information selection further affects the reliability and consistency of fusion results in complex scenarios. This manuscript revisits the essence of generative image fusion under the inspiration of human cognitive laws and proposes a novel infrared and visible image fusion method, termed HCLFuse. First, HCLFuse investigates the quantification theory of information mapping in unsupervised fusion networks, which leads to the design of a multi-scale mask-regulated variational bottleneck encoder. This encoder applies posterior probability modeling and information decomposition to extract accurate and concise low-level modal information, thereby supporting the generation of high-fidelity structural details. Furthermore, the probabilistic generative capability of the diffusion model is integrated with physical laws, forming a time-varying physical guidance mechanism that adaptively regulates the generation process at different stages, thereby enhancing the ability of the model to perceive the intrinsic structure of data and reducing dependence on data quality. Experimental results show that the proposed method achieves state-of-the-art fusion performance in qualitative and quantitative evaluations across multiple datasets and significantly improves semantic segmentation metrics. This fully demonstrates the advantages of this generative image fusion method, drawing inspiration from human cognition, in enhancing structural consistency and detail quality.

[146] Exploring Complementarity and Explainability in CNNs for Periocular Verification Across Acquisition Distances

Fernando Alonso-Fernandez, Kevin Hernandez Diaz, Jose M. Buades, Kiran Raja, Josef Bigun

Main category: cs.CV

TL;DR: Study of CNN complementarity for periocular verification across distances on UBIPr database, showing fusion of three networks (SqueezeNet, MobileNetv2, ResNet50) achieves state-of-the-art performance.

DetailsMotivation: To investigate how different CNN architectures with varying complexity levels complement each other for periocular biometric verification, especially at different distances.

Method: Train three CNN architectures (SqueezeNet, MobileNetv2, ResNet50) on VGGFace2 eye crops, analyze performance with cosine and chi2 metrics, compare network initializations, apply score-level fusion via logistic regression, and use LIME heatmaps and Jensen-Shannon divergence to compare attention patterns.

Result: ResNet50 performs best individually, but fusion provides substantial gains, especially when combining all three networks. Heatmaps reveal networks focus on distinct image regions, explaining their complementarity. Method significantly outperforms previous works on UBIPr.

Conclusion: The complementarity of different CNN architectures enables significant performance improvements through fusion, achieving new state-of-the-art results for periocular verification on UBIPr database.

Abstract: We study the complementarity of different CNNs for periocular verification at different distances on the UBIPr database. We train three architectures of increasing complexity (SqueezeNet, MobileNetv2, and ResNet50) on a large set of eye crops from VGGFace2. We analyse performance with cosine and chi2 metrics, compare different network initialisations, and apply score-level fusion via logistic regression. In addition, we use LIME heatmaps and Jensen-Shannon divergence to compare attention patterns of the CNNs. While ResNet50 consistently performs best individually, the fusion provides substantial gains, especially when combining all three networks. Heatmaps show that networks usually focus on distinct regions of a given image, which explains their complementarity. Our method significantly outperforms previous works on UBIPr, achieving a new state-of-the-art.

[147] Beyond Imitation: Constraint-Aware Trajectory Generation with Flow Matching For End-to-End Autonomous Driving

Lin Liu, Guanyi Yu, Ziying Song, Junqiao Li, Caiyan Jia, Feiyang Jia, Peiliang Wu, Yandan Luo

Main category: cs.CV

TL;DR: CATG is a novel autonomous driving planning framework that uses Constrained Flow Matching to generate diverse trajectories while incorporating safety and kinematic constraints directly into the generative process, avoiding mode collapse and eliminating the need for additional optimization stages.

DetailsMotivation: Prevailing imitation learning methods suffer from mode collapse and fail to produce diverse trajectory hypotheses, while existing generative approaches struggle to incorporate safety and physical constraints directly, requiring additional optimization stages to refine outputs.

Method: Leverages Constrained Flow Matching to explicitly model the flow matching process, allowing flexible guidance from conditioning signals. Imposes explicit constraints directly within the flow matching process to ensure trajectories adhere to safety and kinematic rules. Parameterizes driving aggressiveness as a control signal during generation.

Result: Achieved 2nd place on the NavSim v2 challenge with an EPDMS score of 51.31 and received the Innovation Award.

Conclusion: CATG successfully addresses mode collapse in imitation learning and constraint incorporation issues in generative approaches by using Constrained Flow Matching, enabling diverse trajectory generation with built-in safety constraints and style control.

Abstract: Planning is a critical component of end-to-end autonomous driving. However, prevailing imitation learning methods often suffer from mode collapse, failing to produce diverse trajectory hypotheses. Meanwhile, existing generative approaches struggle to incorporate crucial safety and physical constraints directly into the generative process, necessitating an additional optimization stage to refine their outputs. To address these limitations, we propose CATG, a novel planning framework that leverages Constrained Flow Matching. Concretely, CATG explicitly models the flow matching process, which inherently mitigates mode collapse and allows for flexible guidance from various conditioning signals. Our primary contribution is the novel imposition of explicit constraints directly within the flow matching process, ensuring that the generated trajectories adhere to vital safety and kinematic rules. Secondly, CATG parameterizes driving aggressiveness as a control signal during generation, enabling precise manipulation of trajectory style. Notably, on the NavSim v2 challenge, CATG achieved 2nd place with an EPDMS score of 51.31 and was honored with the Innovation Award.

[148] Leveraging Large-Scale Face Datasets for Deep Periocular Recognition via Ocular Cropping

Fernando Alonso-Fernandez, Kevin Hernandez-Diaz, Jose Maria Buades Rubio, Josef Bigun

Main category: cs.CV

TL;DR: This paper evaluates three CNN architectures for periocular biometric recognition using large-scale training data from VGGFace2, achieving state-of-the-art performance on the UFPR-Periocular dataset with 1-2% EER.

DetailsMotivation: To address the limitations of existing periocular recognition methods that rely on small-scale datasets, and to leverage the periocular region's high discrimination capability with minimal acquisition constraints.

Method: Used three CNN architectures of varying depth/complexity trained on 1,907,572 ocular crops from VGGFace2 database, evaluated on VGGFace2-Pose and UFPR-Periocular datasets.

Result: Achieved 9-15% EER on uncontrolled VGGFace2-Pose, but significantly better 1-2% EER on UFPR-Periocular - the lowest reported EERs on this dataset to date.

Conclusion: Large-scale training data significantly improves periocular recognition performance, with best results achieved under controlled acquisition conditions like UFPR-Periocular.

Abstract: We focus on ocular biometrics, specifically the periocular region (the area around the eye), which offers high discrimination and minimal acquisition constraints. We evaluate three Convolutional Neural Network architectures of varying depth and complexity to assess their effectiveness for periocular recognition. The networks are trained on 1,907,572 ocular crops extracted from the large-scale VGGFace2 database. This significantly contrasts with existing works, which typically rely on small-scale periocular datasets for training having only a few thousand images. Experiments are conducted with ocular images from VGGFace2-Pose, a subset of VGGFace2 containing in-the-wild face images, and the UFPR-Periocular database, which consists of selfies captured via mobile devices with user guidance on the screen. Due to the uncontrolled conditions of VGGFace2, the Equal Error Rates (EERs) obtained with ocular crops range from 9-15%, noticeably higher than the 3-6% EERs achieved using full-face images. In contrast, UFPR-Periocular yields significantly better performance (EERs of 1-2%), thanks to higher image quality and more consistent acquisition protocols. To the best of our knowledge, these are the lowest reported EERs on the UFPR dataset to date.

[149] Towards Realistic Earth-Observation Constellation Scheduling: Benchmark and Methodology

Luting Wang, Yinghao Xiang, Hongliang Huang, Dongjun Li, Chen Gao, Si Liu

Main category: cs.CV

TL;DR: A unified framework with AEOS-Bench benchmark suite and AEOS-Former Transformer model for scheduling Agile Earth Observation Satellites constellations, outperforming baselines in task completion and energy efficiency.

DetailsMotivation: Existing methods simplify the complexities of AEOS scheduling under large-scale scenarios, dynamic environments, and stringent constraints, limiting real-world performance.

Method: AEOS-Former uses Transformer-based scheduling with constraint-aware attention and dedicated internal constraint module to model satellite physical/operational limits, trained via simulation-based iterative learning.

Result: AEOS-Former outperforms baseline models in task completion and energy efficiency, with ablation studies validating component contributions.

Conclusion: The framework provides robust AEOS constellation scheduling solution, with AEOS-Bench being the first large-scale benchmark suite for realistic constellation scheduling.

Abstract: Agile Earth Observation Satellites (AEOSs) constellations offer unprecedented flexibility for monitoring the Earth’s surface, but their scheduling remains challenging under large-scale scenarios, dynamic environments, and stringent constraints. Existing methods often simplify these complexities, limiting their real-world performance. We address this gap with a unified framework integrating a standardized benchmark suite and a novel scheduling model. Our benchmark suite, AEOS-Bench, contains $3,907$ finely tuned satellite assets and $16,410$ scenarios. Each scenario features $1$ to $50$ satellites and $50$ to $300$ imaging tasks. These scenarios are generated via a high-fidelity simulation platform, ensuring realistic satellite behavior such as orbital dynamics and resource constraints. Ground truth scheduling annotations are provided for each scenario. To our knowledge, AEOS-Bench is the first large-scale benchmark suite tailored for realistic constellation scheduling. Building upon this benchmark, we introduce AEOS-Former, a Transformer-based scheduling model that incorporates a constraint-aware attention mechanism. A dedicated internal constraint module explicitly models the physical and operational limits of each satellite. Through simulation-based iterative learning, AEOS-Former adapts to diverse scenarios, offering a robust solution for AEOS constellation scheduling. Experimental results demonstrate that AEOS-Former outperforms baseline models in task completion and energy efficiency, with ablation studies highlighting the contribution of each component. Code and data are provided in https://github.com/buaa-colalab/AEOSBench.

[150] Exploring the correlation between the type of music and the emotions evoked: A study using subjective questionnaires and EEG

Jelizaveta Jankowska, Bożena Kostek, Fernando Alonso-Fernandez, Prayag Tiwari

Main category: cs.CV

TL;DR: Study examines how different music genres affect human emotions using EEG measurements and subjective surveys, revealing connections between emotions and brain activity.

DetailsMotivation: To demonstrate the impact of different music genres on human emotions and understand the relationship between emotional responses and brain activity.

Method: Used EEG helmet to measure brain activity while participants listened to different music types, combined with subjective surveys from a diverse group of participants with varying gender and musical preferences.

Result: Analysis revealed connections between emotions and observed brain activity, capturing a wide range of emotional responses to different music genres.

Conclusion: Different types of music significantly affect human emotions, and these emotional responses are reflected in measurable brain activity patterns.

Abstract: The subject of this work is to check how different types of music affect human emotions. While listening to music, a subjective survey and brain activity measurements were carried out using an EEG helmet. The aim is to demonstrate the impact of different music genres on emotions. The research involved a diverse group of participants of different gender and musical preferences. This had the effect of capturing a wide range of emotional responses to music. After the experiment, a relationship analysis of the respondents’ questionnaires with EEG signals was performed. The analysis revealed connections between emotions and observed brain activity.

[151] A Hybrid Framework Bridging CNN and ViT based on Theory of Evidence for Diabetic Retinopathy Grading

Junlai Qiu, Yunzhu Chen, Hao Zheng, Yawen Huang, Yuexiang Li

Main category: cs.CV

TL;DR: Proposes an evidential fusion paradigm to combine CNN and ViT backbones for diabetic retinopathy diagnosis, leveraging local and global features respectively to overcome limitations of single-type backbones.

DetailsMotivation: Existing DR diagnosis systems using single-type backbones (CNN or ViT) have performance bottlenecks due to their inherent limitations. CNN excels at local feature extraction while ViT captures global features, suggesting that integrating both could improve performance.

Method: Proposes an evidential fusion paradigm that transforms features from different backbones into supporting evidences via deep evidential networks. This allows adaptive fusion patterns between CNN and ViT backbones based on aggregated opinions.

Result: Experimental results on two publicly available DR grading datasets show improved accuracy compared to state-of-the-art frameworks, with excellent interpretability for feature fusion and decision-making.

Conclusion: The evidential fusion approach effectively combines CNN and ViT strengths, breaking through performance bottlenecks of single-type backbones while providing interpretable feature fusion for diabetic retinopathy diagnosis.

Abstract: Diabetic retinopathy (DR) is a leading cause of vision loss among middle-aged and elderly people, which significantly impacts their daily lives and mental health. To improve the efficiency of clinical screening and enable the early detection of DR, a variety of automated DR diagnosis systems have been recently established based on convolutional neural network (CNN) or vision Transformer (ViT). However, due to the own shortages of CNN / ViT, the performance of existing methods using single-type backbone has reached a bottleneck. One potential way for the further improvements is integrating different kinds of backbones, which can fully leverage the respective strengths of them (\emph{i.e.,} the local feature extraction capability of CNN and the global feature capturing ability of ViT). To this end, we propose a novel paradigm to effectively fuse the features extracted by different backbones based on the theory of evidence. Specifically, the proposed evidential fusion paradigm transforms the features from different backbones into supporting evidences via a set of deep evidential networks. With the supporting evidences, the aggregated opinion can be accordingly formed, which can be used to adaptively tune the fusion pattern between different backbones and accordingly boost the performance of our hybrid model. We evaluated our method on two publicly available DR grading datasets. The experimental results demonstrate that our hybrid model not only improves the accuracy of DR grading, compared to the state-of-the-art frameworks, but also provides the excellent interpretability for feature fusion and decision-making.

[152] GLYPH-SR: Can We Achieve Both High-Quality Image Super-Resolution and High-Fidelity Text Recovery via VLM-guided Latent Diffusion Model?

Mingyu Sung, Seungjae Ham, Kangwoo Kim, Yeokyoung Yoon, Sangseok Yun, Il-Min Kim, Jae-Mo Kang

Main category: cs.CV

TL;DR: GLYPH-SR is a vision-language-guided diffusion framework for image super-resolution that specifically optimizes for both text legibility and perceptual quality in complex natural scenes, addressing the limitations of previous SR methods that treat scene-text as generic texture.

DetailsMotivation: Scene-text in natural images carries crucial actionable information, but current super-resolution methods fail to preserve character-level details, causing OCR failures even when the rest of the image appears sharp. Traditional SR metrics are insensitive to text legibility issues.

Method: GLYPH-SR uses a Text-SR Fusion ControlNet guided by OCR data and a ping-pong scheduler that alternates between text- and scene-centric guidance. The framework is trained on synthetic corpus while keeping the main SR branch frozen.

Result: Across SVT, SCUT-CTW1500, and CUTE80 datasets at x4 and x8 scaling, GLYPH-SR improves OCR F1 by up to +15.18 percentage points over diffusion/GAN baselines while maintaining competitive perceptual quality metrics (MANIQA, CLIP-IQA, MUSIQ).

Conclusion: GLYPH-SR successfully achieves both high text readability and high visual realism simultaneously, delivering super-resolution that both looks right and reads right for practical vision applications.

Abstract: Image super-resolution(SR) is fundamental to many vision system-from surveillance and autonomy to document analysis and retail analytics-because recovering high-frequency details, especially scene-text, enables reliable downstream perception. Scene-text, i.e., text embedded in natural images such as signs, product labels, and storefronts, often carries the most actionable information; when characters are blurred or hallucinated, optical character recognition(OCR) and subsequent decisions fail even if the rest of the image appears sharp. Yet previous SR research has often been tuned to distortion (PSNR/SSIM) or learned perceptual metrics (LIPIS, MANIQA, CLIP-IQA, MUSIQ) that are largely insensitive to character-level errors. Furthermore, studies that do address text SR often focus on simplified benchmarks with isolated characters, overlooking the challenges of text within complex natural scenes. As a result, scene-text is effectively treated as generic texture. For SR to be effective in practical deployments, it is therefore essential to explicitly optimize for both text legibility and perceptual quality. We present GLYPH-SR, a vision-language-guided diffusion framework that aims to achieve both objectives jointly. GLYPH-SR utilizes a Text-SR Fusion ControlNet(TS-ControlNet) guided by OCR data, and a ping-pong scheduler that alternates between text- and scene-centric guidance. To enable targeted text restoration, we train these components on a synthetic corpus while keeping the main SR branch frozen. Across SVT, SCUT-CTW1500, and CUTE80 at x4, and x8, GLYPH-SR improves OCR F1 by up to +15.18 percentage points over diffusion/GAN baseline (SVT x8, OpenOCR) while maintaining competitive MANIQA, CLIP-IQA, and MUSIQ. GLYPH-SR is designed to satisfy both objectives simultaneously-high readability and high visual realism-delivering SR that looks right and reds right.

[153] EEG-Driven Image Reconstruction with Saliency-Guided Diffusion Models

Igor Abramov, Ilya Makarov

Main category: cs.CV

TL;DR: A dual-conditioning framework combining EEG embeddings with spatial saliency maps for enhanced EEG-driven image reconstruction, using ATM for EEG feature extraction and fine-tuning Stable Diffusion 2.1 with LoRA.

DetailsMotivation: Existing EEG-driven image reconstruction methods overlook spatial attention mechanisms, limiting fidelity and semantic coherence in generated images.

Method: Proposes a dual-conditioning framework with EEG embeddings and spatial saliency maps, using Adaptive Thinking Mapper (ATM) for EEG feature extraction, fine-tuning Stable Diffusion 2.1 via LoRA, and ControlNet for spatial control.

Result: Achieves significant improvement in quality of low- and high-level image features on THINGS-EEG dataset, strongly aligning with human visual attention.

Conclusion: Attentional priors resolve EEG ambiguities, enabling high-fidelity reconstructions with applications in medical diagnostics and neuroadaptive interfaces, advancing neural decoding through efficient adaptation of pre-trained diffusion models.

Abstract: Existing EEG-driven image reconstruction methods often overlook spatial attention mechanisms, limiting fidelity and semantic coherence. To address this, we propose a dual-conditioning framework that combines EEG embeddings with spatial saliency maps to enhance image generation. Our approach leverages the Adaptive Thinking Mapper (ATM) for EEG feature extraction and fine-tunes Stable Diffusion 2.1 via Low-Rank Adaptation (LoRA) to align neural signals with visual semantics, while a ControlNet branch conditions generation on saliency maps for spatial control. Evaluated on THINGS-EEG, our method achieves a significant improvement in the quality of low- and high-level image features over existing approaches. Simultaneously, strongly aligning with human visual attention. The results demonstrate that attentional priors resolve EEG ambiguities, enabling high-fidelity reconstructions with applications in medical diagnostics and neuroadaptive interfaces, advancing neural decoding through efficient adaptation of pre-trained diffusion models.

[154] LoCoT2V-Bench: A Benchmark for Long-Form and Complex Text-to-Video Generation

Xiangqing Zheng, Chengyue Wu, Kehai Chen, Min Zhang

Main category: cs.CV

TL;DR: LoCoT2V-Bench is a new benchmark for evaluating long video generation under complex prompts, addressing gaps in existing evaluation methods by focusing on fine-grained alignment, narrative coherence, and thematic expression.

DetailsMotivation: Existing text-to-video benchmarks use simplified prompts and focus on low-level metrics, overlooking fine-grained alignment with complex prompts and abstract dimensions like narrative coherence and thematic expression.

Method: Proposed LoCoT2V-Bench with realistic complex prompts based on real-world videos, and a multi-dimensional evaluation framework including new metrics like event-level alignment, temporal consistency, content clarity, and HERD (Human Expectation Realization Degree).

Result: Evaluation of nine LVG models showed current methods perform well on basic visual/temporal aspects but struggle with inter-event consistency, fine-grained alignment, and high-level thematic adherence.

Conclusion: LoCoT2V-Bench provides a comprehensive platform for evaluating long-form complex text-to-video generation and highlights critical improvement directions for future methods.

Abstract: Recently text-to-video generation has made impressive progress in producing short, high-quality clips, but evaluating long-form outputs remains a major challenge especially when processing complex prompts. Existing benchmarks mostly rely on simplified prompts and focus on low-level metrics, overlooking fine-grained alignment with prompts and abstract dimensions such as narrative coherence and thematic expression. To address these gaps, we propose LoCoT2V-Bench, a benchmark specifically designed for long video generation (LVG) under complex input conditions. Based on various real-world videos, LoCoT2V-Bench introduces a suite of realistic and complex prompts incorporating elements like scene transitions and event dynamics. Moreover, it constructs a multi-dimensional evaluation framework that includes our newly proposed metrics such as event-level alignment, fine-grained temporal consistency, content clarity, and the Human Expectation Realization Degree (HERD) that focuses on more abstract attributes like narrative flow, emotional response, and character development. Using this framework, we conduct a comprehensive evaluation of nine representative LVG models, finding that while current methods perform well on basic visual and temporal aspects, they struggle with inter-event consistency, fine-grained alignment, and high-level thematic adherence, etc. Overall, LoCoT2V-Bench provides a comprehensive and reliable platform for evaluating long-form complex text-to-video generation and highlights critical directions for future method improvement.

[155] A-TPT: Angular Diversity Calibration Properties for Test-Time Prompt Tuning of Vision-Language Models

Shihab Aaqil Ahamed, Udaya S. K. P. Miriya Thanthrige, Ranga Rodrigo, Muhammad Haris Khan

Main category: cs.CV

TL;DR: A-TPT is a test-time prompt tuning framework that improves vision-language model calibration by maximizing angular diversity between class-wise textual features on the unit hypersphere.

DetailsMotivation: Current TPT methods lack optimal angular separation between class-wise textual features, which hurts calibration performance and raises concerns about VLMs' reliability and safety.

Method: Introduces angular diversity by maximizing the minimum pairwise angular distance between normalized textual features on the unit hypersphere, encouraging uniformity in feature distribution.

Result: Consistently surpasses state-of-the-art TPT methods in reducing aggregate average calibration error while maintaining comparable accuracy, with superior zero-shot calibration on distribution shifts and medical datasets.

Conclusion: Promoting angular diversity achieves well-dispersed textual features and significantly improves VLM calibration during test-time adaptation.

Abstract: Test-time prompt tuning (TPT) has emerged as a promising technique for adapting large vision-language models (VLMs) to unseen tasks without relying on labeled data. However, the lack of dispersion between textual features can hurt calibration performance, which raises concerns about VLMs’ reliability, trustworthiness, and safety. Current TPT approaches primarily focus on improving prompt calibration by either maximizing average textual feature dispersion or enforcing orthogonality constraints to encourage angular separation. However, these methods may not always have optimal angular separation between class-wise textual features, which implies overlooking the critical role of angular diversity. To address this, we propose A-TPT, a novel TPT framework that introduces angular diversity to encourage uniformity in the distribution of normalized textual features induced by corresponding learnable prompts. This uniformity is achieved by maximizing the minimum pairwise angular distance between features on the unit hypersphere. We show that our approach consistently surpasses state-of-the-art TPT methods in reducing the aggregate average calibration error while maintaining comparable accuracy through extensive experiments with various backbones on different datasets. Notably, our approach exhibits superior zero-shot calibration performance on natural distribution shifts and generalizes well to medical datasets. We provide extensive analyses, including theoretical aspects, to establish the grounding of A-TPT. These results highlight the potency of promoting angular diversity to achieve well-dispersed textual features, significantly improving VLM calibration during test-time adaptation. Our code will be made publicly available.

[156] PointSt3R: Point Tracking through 3D Grounded Correspondence

Rhodri Guerrier, Adam W. Harley, Dima Damen

Main category: cs.CV

TL;DR: The paper adapts foundational 3D reconstruction models (DUSt3R and MASt3R) for point tracking by combining reconstruction loss with dynamic correspondence training and a visibility head, achieving competitive results on multiple datasets.

DetailsMotivation: To leverage recent advances in 3D reconstruction models for point tracking tasks, particularly focusing on improving performance for both static and dynamic point correspondence.

Method: Fine-tune MASt3R for point tracking using synthetic data, combine reconstruction loss with dynamic correspondence training, add a visibility head, and train/evaluate on frame pairs without temporal context.

Result: Achieves competitive or superior performance on four datasets: TAP-Vid-DAVIS (73.8 δ_avg / 85.8% occlusion acc), significantly outperforms CoTracker3 on EgoPoints (61.3 vs 54.2) and RGB-S (87.0 vs 82.8).

Conclusion: The adapted 3D reconstruction models can effectively handle point tracking tasks, demonstrating strong performance on both static and dynamic point correspondence across multiple benchmarks.

Abstract: Recent advances in foundational 3D reconstruction models, such as DUSt3R and MASt3R, have shown great potential in 2D and 3D correspondence in static scenes. In this paper, we propose to adapt them for the task of point tracking through 3D grounded correspondence. We first demonstrate that these models are competitive point trackers when focusing on static points, present in current point tracking benchmarks ($+33.5%$ on EgoPoints vs. CoTracker2). We propose to combine the reconstruction loss with training for dynamic correspondence along with a visibility head, and fine-tuning MASt3R for point tracking using a relatively small amount of synthetic data. Importantly, we only train and evaluate on pairs of frames where one contains the query point, effectively removing any temporal context. Using a mix of dynamic and static point correspondences, we achieve competitive or superior point tracking results on four datasets (e.g. competitive on TAP-Vid-DAVIS 73.8 $\delta_{avg}$ / 85.8% occlusion acc. for PointSt3R compared to 75.7 / 88.3% for CoTracker2; and significantly outperform CoTracker3 on EgoPoints 61.3 vs 54.2 and RGB-S 87.0 vs 82.8). We also present results on 3D point tracking along with several ablations on training datasets and percentage of dynamic correspondences.

[157] Towards Fine-Grained Vision-Language Alignment for Few-Shot Anomaly Detection

Yuanting Fan, Jun Liu, Xiaochen Chen, Bin-Bin Gao, Jian Li, Yong Liu, Jinlong Peng, Chengjie Wang

Main category: cs.CV

TL;DR: The paper proposes FineGrainedAD, a novel framework for few-shot anomaly detection that addresses semantic misalignment between image descriptions and patch-level visual anomalies through multi-level fine-grained semantic captioning and alignment mechanisms.

DetailsMotivation: Existing few-shot anomaly detection methods rely on pre-trained vision-language models but suffer from semantic misalignment due to lack of detailed textual descriptions, leading to sub-optimal localization performance.

Method: Proposes Multi-Level Fine-Grained Semantic Caption (MFSC) for automatic construction of multi-level textual descriptions, Multi-Level Learnable Prompt (MLLP) for fine-grained semantics integration, and Multi-Level Semantic Alignment (MLSA) with region aggregation and alignment training.

Result: Experiments show superior overall performance in few-shot settings on MVTec-AD and VisA datasets compared to existing methods.

Conclusion: The proposed FineGrainedAD framework effectively addresses semantic misalignment in few-shot anomaly detection through multi-level fine-grained semantic captioning and alignment, achieving improved localization performance.

Abstract: Few-shot anomaly detection (FSAD) methods identify anomalous regions with few known normal samples. Most existing methods rely on the generalization ability of pre-trained vision-language models (VLMs) to recognize potentially anomalous regions through feature similarity between text descriptions and images. However, due to the lack of detailed textual descriptions, these methods can only pre-define image-level descriptions to match each visual patch token to identify potential anomalous regions, which leads to the semantic misalignment between image descriptions and patch-level visual anomalies, achieving sub-optimal localization performance. To address the above issues, we propose the Multi-Level Fine-Grained Semantic Caption (MFSC) to provide multi-level and fine-grained textual descriptions for existing anomaly detection datasets with automatic construction pipeline. Based on the MFSC, we propose a novel framework named FineGrainedAD to improve anomaly localization performance, which consists of two components: Multi-Level Learnable Prompt (MLLP) and Multi-Level Semantic Alignment (MLSA). MLLP introduces fine-grained semantics into multi-level learnable prompts through automatic replacement and concatenation mechanism, while MLSA designs region aggregation strategy and multi-level alignment training to facilitate learnable prompts better align with corresponding visual regions. Experiments demonstrate that the proposed FineGrainedAD achieves superior overall performance in few-shot settings on MVTec-AD and VisA datasets.

[158] Representation-Level Counterfactual Calibration for Debiased Zero-Shot Recognition

Pei Peng, MingKun Xie, Hang Hao, Tong Jin, ShengJun Huang

Main category: cs.CV

TL;DR: A causal inference approach that addresses object-context shortcuts in vision-language models by synthesizing counterfactual embeddings and estimating Total Direct Effect to improve zero-shot reliability without retraining.

DetailsMotivation: Object-context shortcuts undermine zero-shot reliability in vision-language models when test scenes differ from training co-occurrences, creating biased predictions.

Method: Estimate object and background expectations in CLIP’s representation space, synthesize counterfactual embeddings by recombining object features with alternative contexts, and use Total Direct Effect estimation to subtract background-only activation.

Result: Substantially improves both worst-group and average accuracy on context-sensitive benchmarks, establishing new zero-shot state-of-the-art performance without retraining or prompt design.

Conclusion: Provides a lightweight representation-level counterfactual framework for debiased and reliable multimodal reasoning through causal inference.

Abstract: Object-context shortcuts remain a persistent challenge in vision-language models, undermining zero-shot reliability when test-time scenes differ from familiar training co-occurrences. We recast this issue as a causal inference problem and ask: Would the prediction remain if the object appeared in a different environment? To answer this at inference time, we estimate object and background expectations within CLIP’s representation space, and synthesize counterfactual embeddings by recombining object features with diverse alternative contexts sampled from external datasets, batch neighbors, or text-derived descriptions. By estimating the Total Direct Effect and simulating intervention, we further subtract background-only activation, preserving beneficial object-context interactions while mitigating hallucinated scores. Without retraining or prompt design, our method substantially improves both worst-group and average accuracy on context-sensitive benchmarks, establishing a new zero-shot state of the art. Beyond performance, our framework provides a lightweight representation-level counterfactual approach, offering a practical causal avenue for debiased and reliable multimodal reasoning.

[159] DINO-YOLO: Self-Supervised Pre-training for Data-Efficient Object Detection in Civil Engineering Applications

Malaisree P, Youwai S, Kitkobsin T, Janrungautai S, Amorndechaphon D, Rojanavasu P

Main category: cs.CV

TL;DR: DINO-YOLO is a hybrid architecture combining YOLOv12 with DINOv3 self-supervised vision transformers for data-efficient object detection in civil engineering applications with limited annotated data.

DetailsMotivation: Object detection in civil engineering is constrained by limited annotated data in specialized domains, requiring data-efficient solutions that can work with small datasets (<10K images).

Method: Hybrid architecture integrating DINOv3 self-supervised vision transformers with YOLOv12 at two strategic locations: input preprocessing (P0) and mid-backbone enhancement (P3). Systematic ablation studies across five YOLO scales and nine DINOv3 variants.

Result: Substantial improvements across datasets: Tunnel Segment Crack detection (12.4% improvement), Construction PPE (13.7% gain), and KITTI (88.6% improvement) while maintaining real-time inference (30-47 FPS). Medium-scale architectures achieve optimal performance with DualP0P3 integration (55.77% mAP@0.5).

Conclusion: DINO-YOLO establishes state-of-the-art performance for civil engineering datasets with limited data while preserving computational efficiency, providing practical solutions for construction safety monitoring and infrastructure inspection in data-constrained environments.

Abstract: Object detection in civil engineering applications is constrained by limited annotated data in specialized domains. We introduce DINO-YOLO, a hybrid architecture combining YOLOv12 with DINOv3 self-supervised vision transformers for data-efficient detection. DINOv3 features are strategically integrated at two locations: input preprocessing (P0) and mid-backbone enhancement (P3). Experimental validation demonstrates substantial improvements: Tunnel Segment Crack detection (648 images) achieves 12.4% improvement, Construction PPE (1K images) gains 13.7%, and KITTI (7K images) shows 88.6% improvement, while maintaining real-time inference (30-47 FPS). Systematic ablation across five YOLO scales and nine DINOv3 variants reveals that Medium-scale architectures achieve optimal performance with DualP0P3 integration (55.77% mAP@0.5), while Small-scale requires Triple Integration (53.63%). The 2-4x inference overhead (21-33ms versus 8-16ms baseline) remains acceptable for field deployment on NVIDIA RTX 5090. DINO-YOLO establishes state-of-the-art performance for civil engineering datasets (<10K images) while preserving computational efficiency, providing practical solutions for construction safety monitoring and infrastructure inspection in data-constrained environments.

[160] Counteracting Matthew Effect in Self-Improvement of LVLMs through Head-Tail Re-balancing

Xin Guo, Zhiheng Xi, Yiwen Ding, Yitao Zhai, Xiaowei Shi, Xunliang Cai, Tao Gui, Qi Zhang, Xuanjing Huang

Main category: cs.CV

TL;DR: The paper identifies a Matthew effect in self-improvement of LVLMs where models excel at simple queries but struggle with complex ones, leading to imbalanced optimization. It proposes distribution-reshaping and trajectory-resampling strategies to re-balance head-tail data during self-improvement.

DetailsMotivation: Current self-improvement paradigms for LVLMs create an imbalance where models prioritize simple reasoning skills over complex ones, leading to performance bottlenecks due to the Matthew effect - where the rich get richer and poor get poorer in terms of learning capability.

Method: Proposes four efficient strategies from two perspectives: distribution-reshaping and trajectory-resampling, to achieve head-tail re-balancing during the exploration-and-learning self-improvement process.

Result: Extensive experiments on Qwen2-VL-7B-Instruct and InternVL2.5-4B models show consistent improvement in visual reasoning capabilities, outperforming vanilla self-improvement by 3.86 points on average across visual reasoning tasks.

Conclusion: The proposed head-tail re-balancing strategies effectively counteract the Matthew effect in self-improvement, enabling more balanced optimization and overcoming performance bottlenecks in LVLMs.

Abstract: Self-improvement has emerged as a mainstream paradigm for advancing the reasoning capabilities of large vision-language models (LVLMs), where models explore and learn from successful trajectories iteratively. However, we identify a critical issue during this process: the model excels at generating high-quality trajectories for simple queries (i.e., head data) but struggles with more complex ones (i.e., tail data). This leads to an imbalanced optimization that drives the model to prioritize simple reasoning skills, while hindering its ability to tackle more complex reasoning tasks. Over iterations, this imbalance becomes increasingly pronounced–a dynamic we term the “Matthew effect”–which ultimately hinders further model improvement and leads to performance bottlenecks. To counteract this challenge, we introduce four efficient strategies from two perspectives: distribution-reshaping and trajectory-resampling, to achieve head-tail re-balancing during the exploration-and-learning self-improvement process. Extensive experiments on Qwen2-VL-7B-Instruct and InternVL2.5-4B models across visual reasoning tasks demonstrate that our methods consistently improve visual reasoning capabilities, outperforming vanilla self-improvement by 3.86 points on average.

[161] Analysis of the Robustness of an Edge Detector Based on Cellular Automata Optimized by Particle Swarm

Vinícius Ferraria, Eurico Ruivo

Main category: cs.CV

TL;DR: An adaptable edge detector using cellular automata optimized by meta-heuristics with transfer learning was developed to address weaknesses in detecting loose edges and lack of context. The study found that expanding the optimization search space was ineffective and transfer learning provided no significant improvements, though the model could adapt to inputs.

DetailsMotivation: To address weaknesses in edge detection such as difficulty detecting loose edges and lack of context for extracting relevant information from specific problems, and to create an adaptable detector that can adjust to image properties.

Method: Developed an adaptable edge detector using two-dimensional cellular automaton optimized by meta-heuristic combined with transfer learning techniques. Analyzed impact of expanding search space in optimization phase and robustness of adaptability on natural images and specialized subsets.

Result: Expanding the search space in optimization phase was not effective for the chosen image set. The model was able to adapt to input regardless of validation, but transfer learning techniques showed no significant improvements.

Conclusion: The adaptable edge detector using cellular automata and meta-heuristic optimization can adapt to inputs, but expanding the optimization search space was ineffective and transfer learning did not provide meaningful enhancements for the tested image set.

Abstract: The edge detection task is essential in image processing aiming to extract relevant information from an image. One recurring problem in this task is the weaknesses found in some detectors, such as the difficulty in detecting loose edges and the lack of context to extract relevant information from specific problems. To address these weaknesses and adapt the detector to the properties of an image, an adaptable detector described by two-dimensional cellular automaton and optimized by meta-heuristic combined with transfer learning techniques was developed. This study aims to analyze the impact of expanding the search space of the optimization phase and the robustness of the adaptability of the detector in identifying edges of a set of natural images and specialized subsets extracted from the same image set. The results obtained prove that expanding the search space of the optimization phase was not effective for the chosen image set. The study also analyzed the adaptability of the model through a series of experiments and validation techniques and found that, regardless of the validation, the model was able to adapt to the input and the transfer learning techniques applied to the model showed no significant improvements.

[162] SA$^{2}$Net: Scale-Adaptive Structure-Affinity Transformation for Spine Segmentation from Ultrasound Volume Projection Imaging

Hao Xie, Zixun Huang, Yushen Zuo, Yakun Ju, Frank H. F. Leung, N. F. Law, Kin-Man Lam, Yong-Ping Zheng, Sai Ho Ling

Main category: cs.CV

TL;DR: SA²Net is a novel scale-adaptive structure-aware network for spine segmentation in ultrasound volume projection imaging that addresses challenges in learning global contextual knowledge and encoding structural bone features through cross-dimensional correlation learning and structure-affinity transformation.

DetailsMotivation: Spine segmentation from ultrasound VPI is crucial for intelligent scoliosis diagnosis but faces challenges: global contextual knowledge may not be well-learned due to neglected spatial correlation of bone features, and rich structural knowledge about bone shapes and positions needs to be encoded into segmentation.

Method: Proposed SA²Net with: 1) scale-adaptive complementary strategy to learn cross-dimensional long-distance correlation features, 2) structure-affinity transformation that transforms semantic features with class-specific affinity combined with Transformer decoder for structure-aware reasoning, and 3) feature mixing loss aggregation for enhanced training.

Result: SA²Net achieves superior segmentation performance compared to other state-of-the-art methods, demonstrating improved robustness and accuracy in spine segmentation.

Conclusion: SA²Net shows strong potential as a promising tool for advanced scoliosis diagnosis through intelligent spinal image analysis, with adaptability to various backbones enhancing its practical applicability.

Abstract: Spine segmentation, based on ultrasound volume projection imaging (VPI), plays a vital role for intelligent scoliosis diagnosis in clinical applications. However, this task faces several significant challenges. Firstly, the global contextual knowledge of spines may not be well-learned if we neglect the high spatial correlation of different bone features. Secondly, the spine bones contain rich structural knowledge regarding their shapes and positions, which deserves to be encoded into the segmentation process. To address these challenges, we propose a novel scale-adaptive structure-aware network (SA$^{2}$Net) for effective spine segmentation. First, we propose a scale-adaptive complementary strategy to learn the cross-dimensional long-distance correlation features for spinal images. Second, motivated by the consistency between multi-head self-attention in Transformers and semantic level affinity, we propose structure-affinity transformation to transform semantic features with class-specific affinity and combine it with a Transformer decoder for structure-aware reasoning. In addition, we adopt a feature mixing loss aggregation method to enhance model training. This method improves the robustness and accuracy of the segmentation process. The experimental results demonstrate that our SA$^{2}$Net achieves superior segmentation performance compared to other state-of-the-art methods. Moreover, the adaptability of SA$^{2}$Net to various backbones enhances its potential as a promising tool for advanced scoliosis diagnosis using intelligent spinal image analysis. The code and experimental demo are available at https://github.com/taetiseo09/SA2Net.

[163] Dynamic Context-Aware Scene Reasoning Using Vision-Language Alignment in Zero-Shot Real-World Scenarios

Manjunath Prasad Holenarasipura Rajiv, B. M. Vidyavathi

Main category: cs.CV

TL;DR: A Dynamic Context-Aware Scene Reasoning framework that uses Vision-Language Alignment for zero-shot scene understanding in unfamiliar environments without labeled data.

DetailsMotivation: AI systems struggle with unfamiliar scenarios without labeled data, limiting deployment in dynamic real-world environments. Conventional models cannot generalize across unseen contexts.

Method: Integrates pre-trained vision transformers and large language models to align visual semantics with natural language descriptions. Uses a dynamic reasoning module that combines global scene cues and object-level interactions guided by linguistic priors.

Result: Achieves up to 18% improvement in scene understanding accuracy on zero-shot benchmarks (COCO, Visual Genome, Open Images) over baseline models. Shows robust performance in ambiguous or cluttered scenes.

Conclusion: The framework provides a scalable and interpretable approach for context-aware reasoning, advancing zero-shot generalization in dynamic real-world settings through synergistic vision-language fusion.

Abstract: In real-world environments, AI systems often face unfamiliar scenarios without labeled data, creating a major challenge for conventional scene understanding models. The inability to generalize across unseen contexts limits the deployment of vision-based applications in dynamic, unstructured settings. This work introduces a Dynamic Context-Aware Scene Reasoning framework that leverages Vision-Language Alignment to address zero-shot real-world scenarios. The goal is to enable intelligent systems to infer and adapt to new environments without prior task-specific training. The proposed approach integrates pre-trained vision transformers and large language models to align visual semantics with natural language descriptions, enhancing contextual comprehension. A dynamic reasoning module refines predictions by combining global scene cues and object-level interactions guided by linguistic priors. Extensive experiments on zero-shot benchmarks such as COCO, Visual Genome, and Open Images demonstrate up to 18% improvement in scene understanding accuracy over baseline models in complex and unseen environments. Results also show robust performance in ambiguous or cluttered scenes due to the synergistic fusion of vision and language. This framework offers a scalable and interpretable approach for context-aware reasoning, advancing zero-shot generalization in dynamic real-world settings.

[164] CATCH: A Modular Cross-domain Adaptive Template with Hook

Xinjin Li, Yulie Lu, Jinghan Cao, Yu Ma, Zhenglin Li, Yeyang Zhou

Main category: cs.CV

TL;DR: CATCH is a plug-and-play framework for cross-domain VQA adaptation that uses lightweight domain classification and dual adapters to improve generalization without retraining backbone models.

DetailsMotivation: VQA models like LLaVA perform well on natural images but degrade significantly in out-of-domain scenarios (remote sensing, medical imaging, math diagrams) due to distribution shifts and lack of effective domain adaptation mechanisms.

Method: Decouples visual and linguistic adaptation using two lightweight modules: domain classifier for input image type identification, and dual adapter mechanism with Prompt Adapter for language modulation and Visual Adapter for vision feature adjustment, dynamically injected via unified hook interface.

Result: Achieves consistent performance gains across four domain-specific VQA benchmarks: +2.3 BLEU on MathVQA, +2.6 VQA on MedVQA-RAD, and +3.1 ROUGE on ChartQA, without retraining backbone model.

Conclusion: CATCH provides a scalable and extensible approach to multi-domain VQA, enabling practical deployment across diverse application domains with minimal architectural changes.

Abstract: Recent advances in Visual Question Answering (VQA) have demonstrated impressive performance in natural image domains, with models like LLaVA leveraging large language models (LLMs) for open-ended reasoning. However, their generalization degrades significantly when transferred to out-of-domain scenarios such as remote sensing, medical imaging, or math diagrams, due to large distributional shifts and the lack of effective domain adaptation mechanisms. Existing approaches typically rely on per-domain fine-tuning or bespoke pipelines, which are costly, inflexible, and not scalable across diverse tasks. In this paper, we propose CATCH, a plug-and-play framework for cross-domain adaptation that improves the generalization of VQA models while requiring minimal changes to their core architecture. Our key idea is to decouple visual and linguistic adaptation by introducing two lightweight modules: a domain classifier to identify the input image type, and a dual adapter mechanism comprising a Prompt Adapter for language modulation and a Visual Adapter for vision feature adjustment. Both modules are dynamically injected via a unified hook interface, requiring no retraining of the backbone model. Experimental results across four domain-specific VQA benchmarks demonstrate that our framework achieves consistent performance gains without retraining the backbone model, including +2.3 BLEU on MathVQA, +2.6 VQA on MedVQA-RAD, and +3.1 ROUGE on ChartQA. These results highlight that CATCH provides a scalable and extensible approach to multi-domain VQA, enabling practical deployment across diverse application domains.

[165] Emu3.5: Native Multimodal Models are World Learners

Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xinghang Li, Jirong Liu, Yang Liu, Zhuoyan Luo, Jinsheng Wang, Wenxuan Wang, Yueze Wang, Chengyuan Wang, Fan Zhang, Yingli Zhao, Ting Pan, Xianduo Li, Zecheng Hao, Wenxuan Ma, Zhuo Chen, Yulong Ao, Tiejun Huang, Zhongyuan Wang, Xinlong Wang

Main category: cs.CV

TL;DR: Emu3.5 is a large-scale multimodal world model that predicts next states across vision and language using unified next-token prediction, achieving strong performance in multimodal generation and world modeling with improved inference efficiency.

DetailsMotivation: To create a unified multimodal world model that can handle interleaved vision-language inputs and outputs, enabling complex multimodal reasoning and generation tasks.

Method: Pre-trained end-to-end with unified next-token prediction on 10+ trillion tokens from internet videos, then post-trained with large-scale reinforcement learning. Uses Discrete Diffusion Adaptation (DiDA) for 20x faster inference.

Result: Achieves performance comparable to Gemini 2.5 Flash Image on image tasks, superior results on interleaved generation tasks, and demonstrates spatiotemporally consistent world exploration and open-world embodied manipulation.

Conclusion: Emu3.5 represents a significant advancement in multimodal world modeling with strong native capabilities and efficient inference, making it suitable for diverse real-world applications.

Abstract: We introduce Emu3.5, a large-scale multimodal world model that natively predicts the next state across vision and language. Emu3.5 is pre-trained end-to-end with a unified next-token prediction objective on a corpus of vision-language interleaved data containing over 10 trillion tokens, primarily derived from sequential frames and transcripts of internet videos. The model naturally accepts interleaved vision-language inputs and generates interleaved vision-language outputs. Emu3.5 is further post-trained with large-scale reinforcement learning to enhance multimodal reasoning and generation. To improve inference efficiency, we propose Discrete Diffusion Adaptation (DiDA), which converts token-by-token decoding into bidirectional parallel prediction, accelerating per-image inference by about 20x without sacrificing performance. Emu3.5 exhibits strong native multimodal capabilities, including long-horizon vision-language generation, any-to-image (X2I) generation, and complex text-rich image generation. It also exhibits generalizable world-modeling abilities, enabling spatiotemporally consistent world exploration and open-world embodied manipulation across diverse scenarios and tasks. For comparison, Emu3.5 achieves performance comparable to Gemini 2.5 Flash Image (Nano Banana) on image generation and editing tasks and demonstrates superior results on a suite of interleaved generation tasks. We open-source Emu3.5 at https://github.com/baaivision/Emu3.5 to support community research.

[166] ResMatching: Noise-Resilient Computational Super-Resolution via Guided Conditional Flow Matching

Anirban Ray, Vera Galinova, Florian Jug

Main category: cs.CV

TL;DR: ResMatching is a novel computational super-resolution method using guided conditional flow matching to learn improved data priors, achieving the best trade-off between data fidelity and perceptual realism on biological structures.

DetailsMotivation: Computational super-resolution in fluorescence microscopy is an ill-posed problem that requires strong priors to extrapolate missing frequencies. With better data-driven machine learning techniques, stronger priors can be learned to improve CSR results.

Method: ResMatching uses guided conditional flow matching to learn improved data priors for computational super-resolution. It can sample from an implicitly learned posterior distribution and provides pixel-wise uncertainty estimates.

Result: ResMatching achieves competitive results on 4 diverse biological structures from BioSR dataset against 7 baselines, demonstrating the best trade-off between data fidelity and perceptual realism in all cases. It’s particularly effective when low-resolution images contain noise.

Conclusion: ResMatching provides calibrated posterior distributions enabling pixel-wise uncertainty estimation, allowing users to reject uncertain predictions. The method shows strong performance especially in challenging scenarios where learning strong priors is difficult.

Abstract: Computational Super-Resolution (CSR) in fluorescence microscopy has, despite being an ill-posed problem, a long history. At its very core, CSR is about finding a prior that can be used to extrapolate frequencies in a micrograph that have never been imaged by the image-generating microscope. It stands to reason that, with the advent of better data-driven machine learning techniques, stronger prior can be learned and hence CSR can lead to better results. Here, we present ResMatching, a novel CSR method that uses guided conditional flow matching to learn such improved data-priors. We evaluate ResMatching on 4 diverse biological structures from the BioSR dataset and compare its results against 7 baselines. ResMatching consistently achieves competitive results, demonstrating in all cases the best trade-off between data fidelity and perceptual realism. We observe that CSR using ResMatching is particularly effective in cases where a strong prior is hard to learn, e.g. when the given low-resolution images contain a lot of noise. Additionally, we show that ResMatching can be used to sample from an implicitly learned posterior distribution and that this distribution is calibrated for all tested use-cases, enabling our method to deliver a pixel-wise data-uncertainty term that can guide future users to reject uncertain predictions.

[167] Spiking Patches: Asynchronous, Sparse, and Efficient Tokens for Event Cameras

Christoffer Koo Øhrstrøm, Ronja Güldenring, Lazaros Nalpantidis

Main category: cs.CV

TL;DR: Spiking Patches tokenizer preserves event camera properties (asynchronous, sparse) while matching/surpassing frame/voxel methods in accuracy with 3.4-10.4x faster inference.

DetailsMotivation: Existing event representations (frames, voxels) lose the asynchronous and spatially sparse properties of event cameras, which are their key advantages.

Method: Propose Spiking Patches tokenizer that converts event streams into tokens preserving asynchronous and sparse properties, evaluated with GNN, PCN, and Transformer on gesture recognition and object detection.

Result: 3.4x faster than voxels, 10.4x faster than frames while matching accuracy; absolute improvements up to 3.8% for gesture recognition and 1.4% for object detection.

Conclusion: Tokenization is a novel direction for event-based vision that preserves event camera properties without sacrificing accuracy, enabling faster inference.

Abstract: We propose tokenization of events and present a tokenizer, Spiking Patches, specifically designed for event cameras. Given a stream of asynchronous and spatially sparse events, our goal is to discover an event representation that preserves these properties. Prior works have represented events as frames or as voxels. However, while these representations yield high accuracy, both frames and voxels are synchronous and decrease the spatial sparsity. Spiking Patches gives the means to preserve the unique properties of event cameras and we show in our experiments that this comes without sacrificing accuracy. We evaluate our tokenizer using a GNN, PCN, and a Transformer on gesture recognition and object detection. Tokens from Spiking Patches yield inference times that are up to 3.4x faster than voxel-based tokens and up to 10.4x faster than frames. We achieve this while matching their accuracy and even surpassing in some cases with absolute improvements up to 3.8 for gesture recognition and up to 1.4 for object detection. Thus, tokenization constitutes a novel direction in event-based vision and marks a step towards methods that preserve the properties of event cameras.

[168] PT-DETR: Small Target Detection Based on Partially-Aware Detail Focus

Bingcong Huo, Zhiming Wang

Main category: cs.CV

TL;DR: PT-DETR improves RT-DETR for UAV small object detection with PADF and MFFF modules, achieving 1.6-1.7% mAP gains on VisDrone2019 with lower complexity.

DetailsMotivation: Address challenges in UAV object detection including complex backgrounds, severe occlusion, dense small objects, and varying lighting conditions.

Method: Introduces Partially-Aware Detail Focus (PADF) Module for enhanced small object feature extraction, Median-Frequency Feature Fusion (MFFF) module for better detail capture, and Focaler-SIoU for improved bounding box matching.

Result: Achieves mAP improvements of 1.6% and 1.7% on VisDrone2019 dataset compared to RT-DETR, with lower computational complexity and fewer parameters.

Conclusion: PT-DETR demonstrates robustness and feasibility for small-object detection tasks in UAV imagery.

Abstract: To address the challenges in UAV object detection, such as complex backgrounds, severe occlusion, dense small objects, and varying lighting conditions,this paper proposes PT-DETR based on RT-DETR, a novel detection algorithm specifically designed for small objects in UAV imagery. In the backbone network, we introduce the Partially-Aware Detail Focus (PADF) Module to enhance feature extraction for small objects. Additionally,we design the Median-Frequency Feature Fusion (MFFF) module,which effectively improves the model’s ability to capture small-object details and contextual information. Furthermore,we incorporate Focaler-SIoU to strengthen the model’s bounding box matching capability and increase its sensitivity to small-object features, thereby further enhancing detection accuracy and robustness. Compared with RT-DETR, our PT-DETR achieves mAP improvements of 1.6% and 1.7% on the VisDrone2019 dataset with lower computational complexity and fewer parameters, demonstrating its robustness and feasibility for small-object detection tasks.

[169] All You Need for Object Detection: From Pixels, Points, and Prompts to Next-Gen Fusion and Multimodal LLMs/VLMs in Autonomous Vehicles

Sayed Pedram Haeri Boroujeni, Niloufar Mehrabi, Hazim Alzorgan, Ahmad Sarlak, Mahlagha Fazeli, Abolfazl Razi

Main category: cs.CV

TL;DR: This survey provides a forward-looking analysis of object detection in Autonomous Vehicles, focusing on emerging AI paradigms like Vision-Language Models, Large Language Models, and Generative AI rather than outdated techniques.

DetailsMotivation: To bridge the gap in fragmented knowledge across multimodal perception, contextual reasoning, and cooperative intelligence in autonomous vehicle object detection, and provide a comprehensive roadmap for future developments.

Method: Systematically reviews AV sensors and fusion strategies, introduces structured categorization of AV datasets (ego-vehicle, infrastructure-based, cooperative), and analyzes cutting-edge detection methodologies including 2D/3D pipelines, hybrid sensor fusion, and transformer-driven approaches.

Result: The survey synthesizes current capabilities in object detection for AVs, highlighting the potential integration of traditional sensor systems with LLM/VLM-driven perception frameworks.

Conclusion: Delivers a clear roadmap of current capabilities, open challenges, and future opportunities in autonomous vehicle object detection, emphasizing the transformative potential of emerging AI technologies.

Abstract: Autonomous Vehicles (AVs) are transforming the future of transportation through advances in intelligent perception, decision-making, and control systems. However, their success is tied to one core capability, reliable object detection in complex and multimodal environments. While recent breakthroughs in Computer Vision (CV) and Artificial Intelligence (AI) have driven remarkable progress, the field still faces a critical challenge as knowledge remains fragmented across multimodal perception, contextual reasoning, and cooperative intelligence. This survey bridges that gap by delivering a forward-looking analysis of object detection in AVs, emphasizing emerging paradigms such as Vision-Language Models (VLMs), Large Language Models (LLMs), and Generative AI rather than re-examining outdated techniques. We begin by systematically reviewing the fundamental spectrum of AV sensors (camera, ultrasonic, LiDAR, and Radar) and their fusion strategies, highlighting not only their capabilities and limitations in dynamic driving environments but also their potential to integrate with recent advances in LLM/VLM-driven perception frameworks. Next, we introduce a structured categorization of AV datasets that moves beyond simple collections, positioning ego-vehicle, infrastructure-based, and cooperative datasets (e.g., V2V, V2I, V2X, I2I), followed by a cross-analysis of data structures and characteristics. Ultimately, we analyze cutting-edge detection methodologies, ranging from 2D and 3D pipelines to hybrid sensor fusion, with particular attention to emerging transformer-driven approaches powered by Vision Transformers (ViTs), Large and Small Language Models (SLMs), and VLMs. By synthesizing these perspectives, our survey delivers a clear roadmap of current capabilities, open challenges, and future opportunities.

[170] Towards Reliable Sea Ice Drift Estimation in the Arctic Deep Learning Optical Flow on RADARSAT-2

Daniela Martin, Joseph Gallego

Main category: cs.CV

TL;DR: Deep learning optical flow models achieve sub-kilometer accuracy for sea ice drift estimation from SAR imagery, outperforming classical methods and providing spatially continuous motion fields.

DetailsMotivation: Accurate sea ice drift estimation is critical for Arctic navigation and climate research, but existing methods using classical optical flow have limitations in complex scenarios. Deep learning approaches have shown superior performance in computer vision, motivating their application to satellite SAR imagery for sea ice monitoring.

Method: Conducted the first large-scale benchmark of 48 deep learning optical flow models on RADARSAT-2 ScanSAR sea ice imagery, evaluated using endpoint error (EPE) and Fl metrics against GNSS-tracked buoys as ground truth.

Result: Several models achieved sub-kilometer accuracy (EPE 6-8 pixels, 300-400m), which is small relative to spatial scales of sea ice motion. Models captured consistent regional drift patterns and demonstrated substantial improvement over classical optical flow methods.

Conclusion: Deep learning optical flow methods can be effectively transferred to polar remote sensing, providing spatially continuous drift fields that offer new opportunities for Arctic navigation and climate modeling.

Abstract: Accurate estimation of sea ice drift is critical for Arctic navigation, climate research, and operational forecasting. While optical flow, a computer vision technique for estimating pixel wise motion between consecutive images, has advanced rapidly in computer vision, its applicability to geophysical problems and to satellite SAR imagery remains underexplored. Classical optical flow methods rely on mathematical models and strong assumptions about motion, which limit their accuracy in complex scenarios. Recent deep learning based approaches have substantially improved performance and are now the standard in computer vision, motivating their application to sea ice drift estimation. We present the first large scale benchmark of 48 deep learning optical flow models on RADARSAT 2 ScanSAR sea ice imagery, evaluated with endpoint error (EPE) and Fl all metrics against GNSS tracked buoys. Several models achieve sub kilometer accuracy (EPE 6 to 8 pixels, 300 to 400 m), a small error relative to the spatial scales of sea ice motion and typical navigation requirements in the Arctic. Our results demonstrate that the models are capable of capturing consistent regional drift patterns and that recent deep learning based optical flow methods, which have substantially improved motion estimation accuracy compared to classical methods, can be effectively transferred to polar remote sensing. Optical flow produces spatially continuous drift fields, providing motion estimates for every image pixel rather than at sparse buoy locations, offering new opportunities for navigation and climate modeling.

[171] Improving Classification of Occluded Objects through Scene Context

Courtney M. King, Daniel D. Leeds, Damian Lyons, George Kalaitzis

Main category: cs.CV

TL;DR: The paper presents two scene-based information fusion techniques to improve object detection under occlusion: one selects custom object networks based on scene context before prediction, and another fuses scene knowledge into object scores after detection.

DetailsMotivation: Occlusions pose significant challenges to object recognition algorithms, and scene context can provide valuable additional information to reduce errors caused by occlusions, similar to how biological vision works.

Method: Two distinct scene-based information fusion techniques: (1) pre-prediction method that selects custom object networks based on identified background scene, and (2) post-detection method that fuses scene knowledge into initial object scores from RPN.

Result: The algorithms show overall improvement in both recall and precision against baseline methods on challenging datasets with partial occlusions. Training on a combination of occluded and unoccluded images performs better than other training methodologies.

Conclusion: The method is interpretable, easily adaptable to other datasets, and offers promising future research directions and practical applications for handling occlusions in object detection.

Abstract: The presence of occlusions has provided substantial challenges to typically-powerful object recognition algorithms. Additional sources of information can be extremely valuable to reduce errors caused by occlusions. Scene context is known to aid in object recognition in biological vision. In this work, we attempt to add robustness into existing Region Proposal Network-Deep Convolutional Neural Network (RPN-DCNN) object detection networks through two distinct scene-based information fusion techniques. We present one algorithm under each methodology: the first operates prior to prediction, selecting a custom object network to use based on the identified background scene, and the second operates after detection, fusing scene knowledge into initial object scores output by the RPN. We demonstrate our algorithms on challenging datasets featuring partial occlusions, which show overall improvement in both recall and precision against baseline methods. In addition, our experiments contrast multiple training methodologies for occlusion handling, finding that training on a combination of both occluded and unoccluded images demonstrates an improvement over the others. Our method is interpretable and can easily be adapted to other datasets, offering many future directions for research and practical applications.

[172] Process Integrated Computer Vision for Real-Time Failure Prediction in Steel Rolling Mill

Vaibhav Kurrey, Sivakalyan Pujari, Gagan Raj Gupta

Main category: cs.CV

TL;DR: A machine vision system using industrial cameras and deep learning predicts equipment failures in steel rolling mills by analyzing live video streams and sensor data, enabling proactive maintenance.

DetailsMotivation: To reduce unplanned breakdown costs and improve operational reliability in industrial manufacturing by enabling early prediction of equipment failures and process interruptions.

Method: Integration of industrial cameras to monitor equipment operation, alignment, and hot bar motion in real time. Live video streams are processed on a centralized video server using deep learning models, with joint analysis of sensor data from data acquisition systems and visual inputs.

Result: The system enables early prediction of equipment failures, identifies location and probable root causes of failures, and provides actionable insights for proactive maintenance while minimizing computational load on industrial process control systems.

Conclusion: This integrated machine vision approach enhances operational reliability, productivity, and profitability in industrial manufacturing environments through scalable deployment and proactive maintenance capabilities.

Abstract: We present a long-term deployment study of a machine vision-based anomaly detection system for failure prediction in a steel rolling mill. The system integrates industrial cameras to monitor equipment operation, alignment, and hot bar motion in real time along the process line. Live video streams are processed on a centralized video server using deep learning models, enabling early prediction of equipment failures and process interruptions, thereby reducing unplanned breakdown costs. Server-based inference minimizes the computational load on industrial process control systems (PLCs), supporting scalable deployment across production lines with minimal additional resources. By jointly analyzing sensor data from data acquisition systems and visual inputs, the system identifies the location and probable root causes of failures, providing actionable insights for proactive maintenance. This integrated approach enhances operational reliability, productivity, and profitability in industrial manufacturing environments.

[173] The Impact and Outlook of 3D Gaussian Splatting

Bernhard Kerbl

Main category: cs.CV

TL;DR: This paper provides a comprehensive overview of the evolution and key research directions that have emerged following the introduction of 3D Gaussian Splatting (3DGS), highlighting advances in efficiency, dynamic representations, mathematical foundations, platform deployment, and reconstruction speed.

DetailsMotivation: To summarize the extensive research landscape that has developed around 3D Gaussian Splatting since its introduction, documenting how it has transformed 3D scene representations and inspired numerous follow-up contributions.

Method: The authors conduct a systematic review and analysis of key research directions that have emerged in the wake of 3DGS, categorizing advances across multiple dimensions including efficiency, dynamic representations, mathematical foundations, and practical applications.

Result: The analysis reveals significant progress in resource-efficient training and rendering, evolution toward 4D dynamic representations, deeper mathematical exploration, mobile/VR platform deployment, massive-scale environment handling, and near-instant radiance field reconstruction.

Conclusion: 3D Gaussian Splatting has evolved from a breakthrough representation into a versatile and foundational tool for 3D vision and graphics, with extensive research advancing its capabilities across multiple dimensions.

Abstract: Since its introduction, 3D Gaussian Splatting (3DGS) has rapidly transformed the landscape of 3D scene representations, inspiring an extensive body of associated research. Follow-up work includes analyses and contributions that enhance the efficiency, scalability, and real-world applicability of 3DGS. In this summary, we present an overview of several key directions that have emerged in the wake of 3DGS. We highlight advances enabling resource-efficient training and rendering, the evolution toward dynamic (or four-dimensional, 4DGS) representations, and deeper exploration of the mathematical foundations underlying its appearance modeling and rendering process. Furthermore, we examine efforts to bring 3DGS to mobile and virtual reality platforms, its extension to massive-scale environments, and recent progress toward near-instant radiance field reconstruction via feed-forward or distributed computation. Collectively, these developments illustrate how 3DGS has evolved from a breakthrough representation into a versatile and foundational tool for 3D vision and graphics.

[174] SteerVLM: Robust Model Control through Lightweight Activation Steering for Vision Language Models

Anushka Sivakumar, Andrew Zhang, Zaber Hakim, Chris Thomas

Main category: cs.CV

TL;DR: SteerVLM is a lightweight steering module that guides Vision-Language Models to better follow instructions by dynamically adjusting activations between language and image modalities, requiring only 0.14% of the original model’s parameters.

DetailsMotivation: To enable fine-grained control over VLM outputs during inference without modifying model weights, while preserving performance on other tasks and avoiding manual intervention requirements.

Method: Learns from latent embeddings of paired prompts to dynamically adjust activations connecting language modality with image context, using dimension-wise activation modulation and adaptive steering across layers.

Result: Outperforms existing intervention techniques on steering and hallucination mitigation benchmarks, and introduces VNIA dataset for VLM steering evaluation.

Conclusion: Provides a robust solution for multimodal model control through activation engineering, enabling inference-time control over complex output semantics with minimal parameter overhead.

Abstract: This work introduces SteerVLM, a lightweight steering module designed to guide Vision-Language Models (VLMs) towards outputs that better adhere to desired instructions. Our approach learns from the latent embeddings of paired prompts encoding target and converse behaviors to dynamically adjust activations connecting the language modality with image context. This allows for fine-grained, inference-time control over complex output semantics without modifying model weights while preserving performance on off-target tasks. Our steering module requires learning parameters equal to 0.14% of the original VLM’s size. Our steering module gains model control through dimension-wise activation modulation and adaptive steering across layers without requiring pre-extracted static vectors or manual tuning of intervention points. Furthermore, we introduce VNIA (Visual Narrative Intent Alignment), a multimodal dataset specifically created to facilitate the development and evaluation of VLM steering techniques. Our method outperforms existing intervention techniques on steering and hallucination mitigation benchmarks for VLMs and proposes a robust solution for multimodal model control through activation engineering.

[175] ChartAB: A Benchmark for Chart Grounding & Dense Alignment

Aniruddh Bansal, Davit Soselia, Dang Nguyen, Tianyi Zhou

Main category: cs.CV

TL;DR: The paper introduces ChartAlign Benchmark (ChartAB) to evaluate vision-language models’ chart grounding capabilities, including data extraction, element localization, and attribute recognition, revealing significant limitations in current models.

DetailsMotivation: Existing vision-language models lack accurate perception of details and struggle with fine-grained structure extraction from charts, which hinders their ability to compare multiple charts and perform reasoning tasks.

Method: Developed a comprehensive benchmark with JSON templates for evaluation metrics, and incorporated a two-stage inference workflow to assess models’ capability to align and compare elements across multiple charts.

Result: Evaluation of recent VLMs revealed perception biases, weaknesses, robustness issues, and hallucinations in chart understanding, highlighting fine-grained discrepancies among models.

Conclusion: Current VLMs have significant limitations in chart understanding tasks, and the benchmark identifies specific skills that need strengthening in these models.

Abstract: Charts play an important role in visualization, reasoning, data analysis, and the exchange of ideas among humans. However, existing vision-language models (VLMs) still lack accurate perception of details and struggle to extract fine-grained structures from charts. Such limitations in chart grounding also hinder their ability to compare multiple charts and reason over them. In this paper, we introduce a novel “ChartAlign Benchmark (ChartAB)” to provide a comprehensive evaluation of VLMs in chart grounding tasks, i.e., extracting tabular data, localizing visualization elements, and recognizing various attributes from charts of diverse types and complexities. We design a JSON template to facilitate the calculation of evaluation metrics specifically tailored for each grounding task. By incorporating a novel two-stage inference workflow, the benchmark can further evaluate VLMs’ capability to align and compare elements/attributes across two charts. Our analysis of evaluations on several recent VLMs reveals new insights into their perception biases, weaknesses, robustness, and hallucinations in chart understanding. These findings highlight the fine-grained discrepancies among VLMs in chart understanding tasks and point to specific skills that need to be strengthened in current models.

[176] HEIR: Learning Graph-Based Motion Hierarchies

Cheng Zheng, William Koch, Baiang Li, Felix Heide

Main category: cs.CV

TL;DR: A general hierarchical motion modeling method that learns structured, interpretable motion relationships directly from data using graph-based hierarchies and differentiable graph learning.

DetailsMotivation: Existing methods rely on manually-defined or heuristic hierarchies with fixed motion primitives, limiting generalizability across different tasks. There's a need for adaptable, data-driven hierarchical modeling.

Method: Represents motions using graph-based hierarchies that decompose global absolute motions into parent-inherited patterns and local motion residuals. Formulates hierarchy inference as a differentiable graph learning problem using graph neural networks.

Result: Successfully reconstructs intrinsic motion hierarchy in 1D and 2D cases, and produces more realistic and interpretable deformations compared to baseline on dynamic 3D Gaussian splatting scenes.

Conclusion: The method provides an adaptable, data-driven hierarchical modeling paradigm applicable to a broad range of motion-centric tasks across computer vision, graphics, and robotics.

Abstract: Hierarchical structures of motion exist across research fields, including computer vision, graphics, and robotics, where complex dynamics typically arise from coordinated interactions among simpler motion components. Existing methods to model such dynamics typically rely on manually-defined or heuristic hierarchies with fixed motion primitives, limiting their generalizability across different tasks. In this work, we propose a general hierarchical motion modeling method that learns structured, interpretable motion relationships directly from data. Our method represents observed motions using graph-based hierarchies, explicitly decomposing global absolute motions into parent-inherited patterns and local motion residuals. We formulate hierarchy inference as a differentiable graph learning problem, where vertices represent elemental motions and directed edges capture learned parent-child dependencies through graph neural networks. We evaluate our hierarchical reconstruction approach on three examples: 1D translational motion, 2D rotational motion, and dynamic 3D scene deformation via Gaussian splatting. Experimental results show that our method reconstructs the intrinsic motion hierarchy in 1D and 2D cases, and produces more realistic and interpretable deformations compared to the baseline on dynamic 3D Gaussian splatting scenes. By providing an adaptable, data-driven hierarchical modeling paradigm, our method offers a formulation applicable to a broad range of motion-centric tasks. Project Page: https://light.princeton.edu/HEIR/

[177] The Quest for Generalizable Motion Generation: Data, Model, and Evaluation

Jing Lin, Ruisi Wang, Junzhe Lu, Ziqi Huang, Guorui Song, Ailing Zeng, Xian Liu, Chen Wei, Wanqi Yin, Qingping Sun, Zhongang Cai, Lei Yang, Ziwei Liu

Main category: cs.CV

TL;DR: A framework that transfers knowledge from video generation to 3D human motion generation across data, modeling, and evaluation pillars, achieving superior generalization performance.

DetailsMotivation: Existing 3D human motion generation models face generalization bottlenecks, while video generation has shown remarkable generalization in modeling human behaviors, suggesting transferable insights.

Method: Proposes ViMoGen-228K dataset integrating MoCap data with web videos and synthesized samples; ViMoGen diffusion transformer with gated multimodal conditioning; ViMoGen-light distilled variant; and MBench hierarchical evaluation benchmark.

Result: Significantly outperforms existing approaches in both automatic and human evaluations, demonstrating improved generalization capability.

Conclusion: The framework successfully transfers video generation knowledge to motion generation, substantially expanding semantic diversity and enhancing generalization performance across multiple evaluation dimensions.

Abstract: Despite recent advances in 3D human motion generation (MoGen) on standard benchmarks, existing models still face a fundamental bottleneck in their generalization capability. In contrast, adjacent generative fields, most notably video generation (ViGen), have demonstrated remarkable generalization in modeling human behaviors, highlighting transferable insights that MoGen can leverage. Motivated by this observation, we present a comprehensive framework that systematically transfers knowledge from ViGen to MoGen across three key pillars: data, modeling, and evaluation. First, we introduce ViMoGen-228K, a large-scale dataset comprising 228,000 high-quality motion samples that integrates high-fidelity optical MoCap data with semantically annotated motions from web videos and synthesized samples generated by state-of-the-art ViGen models. The dataset includes both text-motion pairs and text-video-motion triplets, substantially expanding semantic diversity. Second, we propose ViMoGen, a flow-matching-based diffusion transformer that unifies priors from MoCap data and ViGen models through gated multimodal conditioning. To enhance efficiency, we further develop ViMoGen-light, a distilled variant that eliminates video generation dependencies while preserving strong generalization. Finally, we present MBench, a hierarchical benchmark designed for fine-grained evaluation across motion quality, prompt fidelity, and generalization ability. Extensive experiments show that our framework significantly outperforms existing approaches in both automatic and human evaluations. The code, data, and benchmark will be made publicly available.

[178] Scaling Image Geo-Localization to Continent Level

Philipp Lindenberger, Paul-Edouard Sarlin, Jan Hosang, Matteo Balice, Marc Pollefeys, Simon Lynen, Eduard Trulls

Main category: cs.CV

TL;DR: A hybrid approach for fine-grained image geo-localization at continental scale, combining proxy classification with aerial imagery embeddings to achieve 68% localization within 200m across Europe.

DetailsMotivation: Standard image retrieval fails at global scale due to large image volumes and insufficient coverage, while existing methods trade off between coarse global classification and limited cross-view retrieval in small regions.

Method: Uses proxy classification during training to learn rich feature representations encoding location information, then combines learned prototypes with aerial imagery embeddings to handle ground-level data sparsity.

Result: Achieves localization within 200m for more than 68% of queries on a dataset covering large parts of Europe, demonstrating fine-grained retrieval across multiple countries.

Conclusion: The hybrid approach enables scalable, fine-grained geo-localization at continental scale, overcoming limitations of existing methods through learned representations and aerial imagery integration.

Abstract: Determining the precise geographic location of an image at a global scale remains an unsolved challenge. Standard image retrieval techniques are inefficient due to the sheer volume of images (>100M) and fail when coverage is insufficient. Scalable solutions, however, involve a trade-off: global classification typically yields coarse results (10+ kilometers), while cross-view retrieval between ground and aerial imagery suffers from a domain gap and has been primarily studied on smaller regions. This paper introduces a hybrid approach that achieves fine-grained geo-localization across a large geographic expanse the size of a continent. We leverage a proxy classification task during training to learn rich feature representations that implicitly encode precise location information. We combine these learned prototypes with embeddings of aerial imagery to increase robustness to the sparsity of ground-level data. This enables direct, fine-grained retrieval over areas spanning multiple countries. Our extensive evaluation demonstrates that our approach can localize within 200m more than 68% of queries of a dataset covering a large part of Europe. The code is publicly available at https://scaling-geoloc.github.io.

[179] SEE4D: Pose-Free 4D Generation via Auto-Regressive Video Inpainting

Dongyue Lu, Ao Liang, Tianxin Huang, Xiao Fu, Yuyang Zhao, Baorui Ma, Liang Pan, Wei Yin, Lingdong Kong, Wei Tsang Ooi, Ziwei Liu

Main category: cs.CV

TL;DR: SEE4D is a pose-free video-to-4D framework that uses virtual cameras and view-conditional inpainting to synthesize spatiotemporal content from casual videos without 3D supervision.

DetailsMotivation: Existing methods require manually annotated camera poses which are labor-intensive and brittle for in-the-wild videos. Warp-then-inpaint approaches entangle camera motion with scene dynamics, complicating modeling and inference.

Method: Replaces trajectory prediction with rendering to fixed virtual cameras, separating camera control from scene modeling. Uses view-conditional video inpainting trained to denoise warped images and inpaint missing regions across viewpoints. Features spatiotemporal autoregressive inference with virtual-camera splines and overlapping windows.

Result: Achieves superior generalization and improved performance on cross-view video generation and sparse reconstruction benchmarks compared to pose- or trajectory-conditioned baselines.

Conclusion: SEE4D advances practical 4D world modeling from casual videos by eliminating need for explicit 3D annotations and providing coherent generation at bounded complexity.

Abstract: Immersive applications call for synthesizing spatiotemporal 4D content from casual videos without costly 3D supervision. Existing video-to-4D methods typically rely on manually annotated camera poses, which are labor-intensive and brittle for in-the-wild footage. Recent warp-then-inpaint approaches mitigate the need for pose labels by warping input frames along a novel camera trajectory and using an inpainting model to fill missing regions, thereby depicting the 4D scene from diverse viewpoints. However, this trajectory-to-trajectory formulation often entangles camera motion with scene dynamics and complicates both modeling and inference. We introduce SEE4D, a pose-free, trajectory-to-camera framework that replaces explicit trajectory prediction with rendering to a bank of fixed virtual cameras, thereby separating camera control from scene modeling. A view-conditional video inpainting model is trained to learn a robust geometry prior by denoising realistically synthesized warped images and to inpaint occluded or missing regions across virtual viewpoints, eliminating the need for explicit 3D annotations. Building on this inpainting core, we design a spatiotemporal autoregressive inference pipeline that traverses virtual-camera splines and extends videos with overlapping windows, enabling coherent generation at bounded per-step complexity. We validate See4D on cross-view video generation and sparse reconstruction benchmarks. Across quantitative metrics and qualitative assessments, our method achieves superior generalization and improved performance relative to pose- or trajectory-conditioned baselines, advancing practical 4D world modeling from casual videos.

[180] Masked Diffusion Captioning for Visual Feature Learning

Chao Feng, Zihao Wei, Andrew Owens

Main category: cs.CV

TL;DR: Masked diffusion captioning (MDC) learns visual features by reconstructing masked text tokens from images using a diffusion language model, achieving competitive performance with autoregressive and contrastive methods.

DetailsMotivation: To develop a visual feature learning method that doesn't depend on token position like autoregressive approaches, reducing the need for auxiliary objectives while maintaining strong performance.

Method: Training an image-conditioned masked diffusion language model where text tokens in image-caption pairs are randomly masked, and a visual-conditioned decoder reconstructs the original text.

Result: Linear probing experiments show the learned visual features are competitive with those from autoregressive and contrastive approaches across various models and datasets.

Conclusion: MDC provides an effective alternative to autoregressive captioning for visual feature learning, with position-independent learning signals and competitive downstream task performance.

Abstract: We learn visual features by captioning images with an image-conditioned masked diffusion language model, a formulation we call masked diffusion captioning (MDC). During training, text tokens in each image-caption pair are masked at a randomly chosen ratio, and a decoder conditioned on visual features is trained to reconstruct the original text. After training, the learned visual features can be applied to downstream vision tasks. Unlike autoregressive captioning, the strength of the visual learning signal in MDC does not depend on each token’s position in the sequence, reducing the need for auxiliary objectives. Linear probing experiments across a variety of academic-scale models and datasets show that the learned visual features are competitive with those produced by autoregressive and contrastive approaches.

[181] OmniX: From Unified Panoramic Generation and Perception to Graphics-Ready 3D Scenes

Yukun Huang, Jiwen Yu, Yanning Zhou, Jianan Wang, Xintao Wang, Pengfei Wan, Xihui Liu

Main category: cs.CV

TL;DR: OmniX is a unified framework that repurposes 2D generative models for panoramic perception of geometry, textures, and PBR materials to generate graphics-ready 3D scenes suitable for physically based rendering, relighting, and simulation.

DetailsMotivation: To advance panorama-based 2D lifting techniques to generate graphics-ready 3D scenes that support physically based rendering, relighting, and simulation, addressing the limitation of existing approaches that focus on appearance generation while ignoring intrinsic property perception.

Method: Presents OmniX framework with lightweight cross-modal adapter structure that reuses 2D generative priors for panoramic vision tasks including perception, generation, and completion. Constructs large-scale synthetic panorama dataset with high-quality multimodal panoramas from diverse indoor and outdoor scenes.

Result: Extensive experiments demonstrate effectiveness in panoramic visual perception and graphics-ready 3D scene generation, enabling immersive and physically realistic virtual world generation.

Conclusion: OmniX opens new possibilities for generating graphics-ready 3D scenes by leveraging 2D generative priors for panoramic perception of intrinsic properties, advancing beyond appearance-only generation approaches.

Abstract: There are two prevalent ways to constructing 3D scenes: procedural generation and 2D lifting. Among them, panorama-based 2D lifting has emerged as a promising technique, leveraging powerful 2D generative priors to produce immersive, realistic, and diverse 3D environments. In this work, we advance this technique to generate graphics-ready 3D scenes suitable for physically based rendering (PBR), relighting, and simulation. Our key insight is to repurpose 2D generative models for panoramic perception of geometry, textures, and PBR materials. Unlike existing 2D lifting approaches that emphasize appearance generation and ignore the perception of intrinsic properties, we present OmniX, a versatile and unified framework. Based on a lightweight and efficient cross-modal adapter structure, OmniX reuses 2D generative priors for a broad range of panoramic vision tasks, including panoramic perception, generation, and completion. Furthermore, we construct a large-scale synthetic panorama dataset containing high-quality multimodal panoramas from diverse indoor and outdoor scenes. Extensive experiments demonstrate the effectiveness of our model in panoramic visual perception and graphics-ready 3D scene generation, opening new possibilities for immersive and physically realistic virtual world generation.

[182] Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark

Ziyu Guo, Xinyan Chen, Renrui Zhang, Ruichuan An, Yu Qi, Dongzhi Jiang, Xiangtai Li, Manyuan Zhang, Hongsheng Li, Pheng-Ann Heng

Main category: cs.CV

TL;DR: Video generation models like Veo-3 show emerging reasoning capabilities but are not yet reliable as standalone zero-shot reasoners, exhibiting strengths in short-horizon spatial coherence but limitations in long-horizon causal reasoning.

DetailsMotivation: To investigate whether current video generation models can serve as zero-shot reasoners in challenging visual reasoning scenarios, given their demonstrated capabilities in producing high-fidelity videos and emerging visual perception behaviors.

Method: Conducted empirical evaluation of Veo-3 across 12 reasoning dimensions (spatial, geometric, physical, temporal, embodied logic) using curated MME-CoF benchmark for Chain-of-Frame reasoning assessment.

Result: Video models show promising reasoning in short-horizon spatial coherence, fine-grained grounding, and locally consistent dynamics, but struggle with long-horizon causal reasoning, strict geometric constraints, and abstract logic.

Conclusion: Current video models are not yet reliable as standalone zero-shot reasoners but show encouraging potential as complementary visual engines alongside dedicated reasoning models.

Abstract: Recent video generation models can produce high-fidelity, temporally coherent videos, indicating that they may encode substantial world knowledge. Beyond realistic synthesis, they also exhibit emerging behaviors indicative of visual perception, modeling, and manipulation. Yet, an important question still remains: Are video models ready to serve as zero-shot reasoners in challenging visual reasoning scenarios? In this work, we conduct an empirical study to comprehensively investigate this question, focusing on the leading and popular Veo-3. We evaluate its reasoning behavior across 12 dimensions, including spatial, geometric, physical, temporal, and embodied logic, systematically characterizing both its strengths and failure modes. To standardize this study, we curate the evaluation data into MME-CoF, a compact benchmark that enables in-depth and thorough assessment of Chain-of-Frame (CoF) reasoning. Our findings reveal that while current video models demonstrate promising reasoning patterns on short-horizon spatial coherence, fine-grained grounding, and locally consistent dynamics, they remain limited in long-horizon causal reasoning, strict geometric constraints, and abstract logic. Overall, they are not yet reliable as standalone zero-shot reasoners, but exhibit encouraging signs as complementary visual engines alongside dedicated reasoning models. Project page: https://video-cof.github.io

[183] Two Heads are Better than One: Robust Learning Meets Multi-branch Models

Zongyuan Zhang, Qingwen Bu, Tianyang Duan, Zheng Lin, Yuhao Qing, Zihan Fang, Heming Cui, Dong Huang

Main category: cs.CV

TL;DR: BORT (Branch Orthogonality adversarial Training) is a novel adversarial training method that uses multi-branch neural networks with orthogonal feature spaces to achieve state-of-the-art robustness without additional training data.

DetailsMotivation: Most adversarial defense methods focus on data-centric approaches like generating better adversarial examples or using generative models for additional data. This work revisits model architecture itself and explores adversarial robustness from the perspective of deep feature distribution.

Method: Proposes a multi-branch neural network architecture with branch-orthogonal loss to make each branch’s solution space orthogonal, integrating multiple orthogonal solution spaces without increasing inference time.

Result: Achieves 67.3% robust accuracy on CIFAR-10 and 41.5% on CIFAR-100 against l∞ norm-bounded perturbations (ε=8/255), improving state-of-the-art by +7.23% and +9.07% respectively without using additional training data.

Conclusion: BORT demonstrates that focusing on model architecture and feature space orthogonality can achieve superior adversarial robustness compared to data-centric approaches, even outperforming methods that use much larger training datasets.

Abstract: Deep neural networks (DNNs) are vulnerable to adversarial examples, in which DNNs are misled to false outputs due to inputs containing imperceptible perturbations. Adversarial training, a reliable and effective method of defense, may significantly reduce the vulnerability of neural networks and becomes the de facto standard for robust learning. While many recent works practice the data-centric philosophy, such as how to generate better adversarial examples or use generative models to produce additional training data, we look back to the models themselves and revisit the adversarial robustness from the perspective of deep feature distribution as an insightful complementarity. In this paper, we propose \textit{Branch Orthogonality adveRsarial Training} (BORT) to obtain state-of-the-art performance with solely the original dataset for adversarial training. To practice our design idea of integrating multiple orthogonal solution spaces, we leverage a simple and straightforward multi-branch neural network that eclipses adversarial attacks with no increase in inference time. We heuristically propose a corresponding loss function, branch-orthogonal loss, to make each solution space of the multi-branch model orthogonal. We evaluate our approach on CIFAR-10, CIFAR-100 and SVHN against $\ell_{\infty}$ norm-bounded perturbations of size $\epsilon = 8/255$, respectively. Exhaustive experiments are conducted to show that our method goes beyond all state-of-the-art methods without any tricks. Compared to all methods that do not use additional data for training, our models achieve 67.3% and 41.5% robust accuracy on CIFAR-10 and CIFAR-100 (improving upon the state-of-the-art by +7.23% and +9.07%). We also outperform methods using a training set with a far larger scale than ours.

[184] Quality-Aware Prototype Memory for Face Representation Learning

Evgeny Smirnov, Vasiliy Galyuk, Evgeny Lukyanets

Main category: cs.CV

TL;DR: Quality-Aware Prototype Memory improves face recognition by weighting images based on quality during prototype generation, reducing contamination from low-quality faces.

DetailsMotivation: Standard Prototype Memory treats all images equally, allowing low-quality or poorly recognizable faces to contaminate prototypes and degrade model performance.

Method: Proposed quality-aware prototype generation that assigns different weights to images based on their quality, giving more importance to high-quality images.

Result: Extensive experiments on face recognition benchmarks show advantages over basic Prototype Memory, with more informative prototypes and better performance.

Conclusion: Quality-aware weighting in prototype generation effectively improves face representation learning by reducing negative impact from low-quality images.

Abstract: Prototype Memory is a powerful model for face representation learning. It enables training face recognition models on datasets of any size by generating prototypes (classifier weights) on the fly and efficiently utilizing them. Prototype Memory demonstrated strong results in many face recognition benchmarks. However, the algorithm of prototype generation, used in it, is prone to the problems of imperfectly calculated prototypes in case of low-quality or poorly recognizable faces in the images, selected for the prototype creation. All images of the same person presented in the mini-batch are used with equal weights, and the resulting averaged prototype can be contaminated by imperfect embeddings of low-quality face images. This may lead to misleading training signals and degrade the performance of the trained models. In this paper, we propose a simple and effective way to improve Prototype Memory with quality-aware prototype generation. Quality-Aware Prototype Memory uses different weights for images of different quality in the process of prototype generation. With this improvement, prototypes receive more informative signals from high-quality images and are less affected by low-quality ones. We propose and compare several methods of quality estimation and usage, perform extensive experiments on the different face recognition benchmarks and demonstrate the advantages of the proposed model compared to the basic version of Prototype Memory.

[185] GSE: Group-wise Sparse and Explainable Adversarial Attacks

Shpresim Sadiku, Moritz Wagner, Sebastian Pokutta

Main category: cs.CV

TL;DR: A two-phase algorithm for crafting group-wise sparse adversarial attacks that achieves high sparsity and attack success rates with improved explainability and faster computation.

DetailsMotivation: To address the optimization challenge of crafting group-wise sparse adversarial attacks that are explainable and reveal greater vulnerabilities in DNNs, moving beyond simple ℓ₀ norm regularization to structural sparsity.

Method: Two-phase approach: first optimizes quasinorm adversarial loss using 1/2-quasinorm proximal operator for non-convex programming, then transitions to projected Nesterov’s accelerated gradient descent with ℓ₂-norm regularization on perturbation magnitudes.

Result: Achieved remarkable group-wise sparsity improvements: 50.9% on CIFAR-10 and 38.4% on ImageNet (average case, targeted attack), with 100% attack success rate, faster computation, and improved explainability.

Conclusion: The proposed method successfully generates highly sparse, explainable adversarial attacks that reveal significant vulnerabilities in DNNs while maintaining computational efficiency and high success rates.

Abstract: Sparse adversarial attacks fool deep neural networks (DNNs) through minimal pixel perturbations, often regularized by the $\ell_0$ norm. Recent efforts have replaced this norm with a structural sparsity regularizer, such as the nuclear group norm, to craft group-wise sparse adversarial attacks. The resulting perturbations are thus explainable and hold significant practical relevance, shedding light on an even greater vulnerability of DNNs. However, crafting such attacks poses an optimization challenge, as it involves computing norms for groups of pixels within a non-convex objective. We address this by presenting a two-phase algorithm that generates group-wise sparse attacks within semantically meaningful areas of an image. Initially, we optimize a quasinorm adversarial loss using the $1/2-$quasinorm proximal operator tailored for non-convex programming. Subsequently, the algorithm transitions to a projected Nesterov’s accelerated gradient descent with $2-$norm regularization applied to perturbation magnitudes. Rigorous evaluations on CIFAR-10 and ImageNet datasets demonstrate a remarkable increase in group-wise sparsity, e.g., $50.9%$ on CIFAR-10 and $38.4%$ on ImageNet (average case, targeted attack). This performance improvement is accompanied by significantly faster computation times, improved explainability, and a $100%$ attack success rate.

[186] Dynamic Traceback Learning for Medical Report Generation

Shuchang Ye, Mingyuan Meng, Mingjian Li, Dagan Feng, Usman Naseem, Jinman Kim

Main category: cs.CV

TL;DR: DTrace is a multimodal dynamic traceback learning framework that improves medical report generation by addressing challenges in capturing pathological details and enabling zero-shot inference with only image inputs.

DetailsMotivation: Current medical report generation methods struggle with accurately capturing subtle pathological details and require both visual and textual inputs during inference, limiting their practical application in zero-shot scenarios where only images are available.

Method: Proposes DTrace with a traceback mechanism to supervise semantic validity of generated content and a dynamic learning strategy to adapt to various input proportions. Uses cross-modal knowledge enhancement by recovering masked semantic information from complementary counterparts.

Result: Extensive experiments on IU-XRay and MIMIC-CXR datasets show DTrace outperforms state-of-the-art methods for medical report generation.

Conclusion: DTrace effectively addresses key challenges in medical report generation by improving pathological detail capture and enabling robust performance in zero-shot inference scenarios with only image inputs.

Abstract: Automated medical report generation has demonstrated the potential to significantly reduce the workload associated with time-consuming medical reporting. Recent generative representation learning methods have shown promise in integrating vision and language modalities for medical report generation. However, when trained end-to-end and applied directly to medical image-to-text generation, they face two significant challenges: i) difficulty in accurately capturing subtle yet crucial pathological details, and ii) reliance on both visual and textual inputs during inference, leading to performance degradation in zero-shot inference when only images are available. To address these challenges, this study proposes a novel multimodal dynamic traceback learning framework (DTrace). Specifically, we introduce a traceback mechanism to supervise the semantic validity of generated content and a dynamic learning strategy to adapt to various proportions of image and text input, enabling text generation without strong reliance on the input from both modalities during inference. The learning of cross-modal knowledge is enhanced by supervising the model to recover masked semantic information from a complementary counterpart. Extensive experiments conducted on two benchmark datasets, IU-Xray and MIMIC-CXR, demonstrate that the proposed DTrace framework outperforms state-of-the-art methods for medical report generation.

[187] VerifIoU – Robustness of Object Detection to Perturbations

Noémie Cohen, Mélanie Ducoffe, Ryma Boumazouza, Christophe Gabreau, Claire Pagetti, Xavier Pucel, Audrey Galametz

Main category: cs.CV

TL;DR: Novel Interval Bound Propagation (IBP) approach for formal verification of object detection models using IoU metric, implemented in open source tool IBP IoU.

DetailsMotivation: Need for formal verification of object detection models to ensure accuracy and stability, contributing to more secure and robust machine learning applications.

Method: Interval Bound Propagation (IBP) approach specifically targeting Intersection over Union (IoU) metric, implemented in open source code compatible with abstract interpretation tools.

Result: IBP IoU shows superior performance compared to baseline Vanilla IBP IoU in evaluations on landing approach runway detection and handwritten digit recognition case studies.

Conclusion: IBP IoU approach contributes to more secure and robust machine learning applications through improved formal verification of object detection models.

Abstract: We introduce a novel Interval Bound Propagation (IBP) approach for the formal verification of object detection models, specifically targeting the Intersection over Union (IoU) metric. The approach has been implemented in an open source code, named IBP IoU, compatible with popular abstract interpretation based verification tools. The resulting verifier is evaluated on landing approach runway detection and handwritten digit recognition case studies. Comparisons against a baseline (Vanilla IBP IoU) highlight the superior performance of IBP IoU in ensuring accuracy and stability, contributing to more secure and robust machine learning applications.

[188] EmoAttack: Emotion-to-Image Diffusion Models for Emotional Backdoor Generation

Tianyu Wei, Shanmin Pang, Qi Guo, Yizhuo Ma, Xiaofeng Cao, Qing Guo

Main category: cs.CV

TL;DR: The paper introduces EmoAttack, a new backdoor attack on text-to-image diffusion models that uses emotional text inputs to trigger generation of malicious negative content, potentially manipulating users’ emotions.

DetailsMotivation: To investigate the overlooked risk of using emotion in text inputs to introduce negative content and provoke unfavorable emotions in users through text-to-image diffusion models.

Method: Proposed EmoBooth, which formulates the attack as a diffusion personalization problem by fine-tuning pre-trained diffusion models to establish mapping between emotional word clusters and reference images containing malicious content.

Result: Built a dataset and conducted extensive analysis showing the method’s effectiveness in generating malicious negative content triggered by emotional texts.

Conclusion: Uncovering this threat is critical for society given the widespread use of diffusion models by consumers, highlighting the need for awareness and countermeasures against emotion-aware backdoor attacks.

Abstract: Text-to-image diffusion models can generate realistic images based on textual inputs, enabling users to convey their opinions visually through language. Meanwhile, within language, emotion plays a crucial role in expressing personal opinions in our daily lives and the inclusion of maliciously negative content can lead users astray, exacerbating negative emotions. Recognizing the success of diffusion models and the significance of emotion, we investigate a previously overlooked risk associated with text-to-image diffusion models, that is, utilizing emotion in the input texts to introduce negative content and provoke unfavorable emotions in users. Specifically, we identify a new backdoor attack, i.e., emotion-aware backdoor attack (EmoAttack), which introduces malicious negative content triggered by emotional texts during image generation. We formulate such an attack as a diffusion personalization problem to avoid extensive model retraining and propose the EmoBooth. Unlike existing personalization methods, our approach fine-tunes a pre-trained diffusion model by establishing a mapping between a cluster of emotional words and a given reference image containing malicious negative content. To validate the effectiveness of our method, we built a dataset and conducted extensive analysis and discussion about its effectiveness. Given consumers’ widespread use of diffusion models, uncovering this threat is critical for society.

[189] NerfBaselines: Consistent and Reproducible Evaluation of Novel View Synthesis Methods

Jonas Kulhanek, Torsten Sattler

Main category: cs.CV

TL;DR: NerfBaselines is an evaluation framework that addresses reproducibility issues in novel view synthesis by providing consistent benchmarking tools and simplifying method installation.

DetailsMotivation: Current Neural Radiance Fields (NeRFs) and 3D Gaussian Splatting methods suffer from inconsistent evaluation protocols, difficult installation, and poor generalization, making it hard to track true state-of-the-art progress.

Method: Proposed NerfBaselines framework with consistent benchmarking tools, reproducibility measures, and simplified installation. Includes a web platform for method comparison on standard benchmarks.

Result: Validated implementation by reproducing original paper results. Showed that tiny differences in evaluation protocols can artificially boost performance, questioning validity of existing quantitative comparisons.

Conclusion: NerfBaselines ensures comparable quantitative results and truly measures progress in novel view synthesis, providing valuable contribution to the community.

Abstract: Novel view synthesis is an important problem with many applications, including AR/VR, gaming, and robotic simulations. With the recent rapid development of Neural Radiance Fields (NeRFs) and 3D Gaussian Splatting (3DGS) methods, it is becoming difficult to keep track of the current state of the art (SoTA) due to methods using different evaluation protocols, codebases being difficult to install and use, and methods not generalizing well to novel 3D scenes. In our experiments, we show that even tiny differences in the evaluation protocols of various methods can artificially boost the performance of these methods. This raises questions about the validity of quantitative comparisons performed in the literature. To address these questions, we propose NerfBaselines, an evaluation framework which provides consistent benchmarking tools, ensures reproducibility, and simplifies the installation and use of various methods. We validate our implementation experimentally by reproducing the numbers reported in the original papers. For improved accessibility, we release a web platform that compares commonly used methods on standard benchmarks. We strongly believe NerfBaselines is a valuable contribution to the community as it ensures that quantitative results are comparable and thus truly measure progress in the field of novel view synthesis.

[190] Static for Dynamic: Towards a Deeper Understanding of Dynamic Facial Expressions Using Static Expression Data

Yin Chen, Jia Li, Yu Zhang, Zhenzhen Hu, Shiguang Shan, Meng Wang, Richang Hong

Main category: cs.CV

TL;DR: S4D is a dual-modal learning framework that leverages static facial expression recognition (SFER) data to enhance dynamic facial expression recognition (DFER) performance through self-supervised pre-training and multi-task learning with a Mixture of Adapter Experts module.

DetailsMotivation: Current DFER methods suffer from limited training data compared to SFER, but static and dynamic expressions are inherently correlated, suggesting that abundant SFER data could improve DFER performance.

Method: Proposes S4D framework with dual-modal self-supervised pre-training using shared ViT encoder-decoder, followed by multi-task fine-tuning with Mixture of Adapter Experts (MoAE) module to prevent negative transfer and enable task-specific knowledge acquisition.

Result: Achieves state-of-the-art performance on FERV39K (53.65% WAR), MAFW (58.44% WAR), and DFEW (76.68% WAR) benchmarks, demonstrating significant improvement in DFER.

Conclusion: Leveraging SFER data through the S4D framework effectively enhances DFER performance, and systematic correlation analysis confirms the benefits of integrating static expression data for dynamic recognition tasks.

Abstract: Dynamic facial expression recognition (DFER) infers emotions from the temporal evolution of expressions, unlike static facial expression recognition (SFER), which relies solely on a single snapshot. This temporal analysis provides richer information and promises greater recognition capability. However, current DFER methods often exhibit unsatisfied performance largely due to fewer training samples compared to SFER. Given the inherent correlation between static and dynamic expressions, we hypothesize that leveraging the abundant SFER data can enhance DFER. To this end, we propose Static-for-Dynamic (S4D), a unified dual-modal learning framework that integrates SFER data as a complementary resource for DFER. Specifically, S4D employs dual-modal self-supervised pre-training on facial images and videos using a shared Vision Transformer (ViT) encoder-decoder architecture, yielding improved spatiotemporal representations. The pre-trained encoder is then fine-tuned on static and dynamic expression datasets in a multi-task learning setup to facilitate emotional information interaction. Unfortunately, vanilla multi-task learning in our study results in negative transfer. To address this, we propose an innovative Mixture of Adapter Experts (MoAE) module that facilitates task-specific knowledge acquisition while effectively extracting shared knowledge from both static and dynamic expression data. Extensive experiments demonstrate that S4D achieves a deeper understanding of DFER, setting new state-of-the-art performance on FERV39K, MAFW, and DFEW benchmarks, with weighted average recall (WAR) of 53.65%, 58.44%, and 76.68%, respectively. Additionally, a systematic correlation analysis between SFER and DFER tasks is presented, which further elucidates the potential benefits of leveraging SFER.

[191] A Continuous and Interpretable Morphometric for Robust Quantification of Dynamic Biological Shapes

Roua Rouatbi, Juan-Esteban Suarez Cardona, Alba Villaronga-Luque, Jesse V. Veenvliet, Ivo F. Sbalzarini

Main category: cs.CV

TL;DR: The paper introduces PF-SDM, a method for shape quantification in biomedical imaging that encodes geometric and topological properties of closed shapes, providing robust features for shape comparison and machine learning.

DetailsMotivation: To develop a compact and interpretable method for quantifying shapes in biomedical imaging that captures geometric and topological properties, including skeleton and symmetries, while being mathematically smooth and extendable to temporal dynamics.

Method: The PF-SDM (Push-Forward Signed Distance Morphometric) method encodes geometric and topological properties of closed shapes, provides access to gradients and differential-geometric quantities, extends to temporal dynamics, and allows fusion of spatial intensity distributions with shape dynamics.

Result: The method was benchmarked on synthetic data and applied to predicting body-axis formation in mouse gastruloids, where it outperformed a CNN baseline in both accuracy and speed.

Conclusion: PF-SDM provides an effective framework for shape quantification in biomedical imaging that offers interpretable features, mathematical smoothness, and superior performance compared to CNN baselines for tasks like body-axis formation prediction.

Abstract: We introduce the Push-Forward Signed Distance Morphometric (PF-SDM) for shape quantification in biomedical imaging. The PF-SDM compactly encodes geometric and topological properties of closed shapes, including their skeleton and symmetries. This provides robust and interpretable features for shape comparison and machine learning. The PF-SDM is mathematically smooth, providing access to gradients and differential-geometric quantities. It also extends to temporal dynamics and allows fusing spatial intensity distributions, such as genetic markers, with shape dynamics. We present the PF-SDM theory, benchmark it on synthetic data, and apply it to predicting body-axis formation in mouse gastruloids, outperforming a CNN baseline in both accuracy and speed.

[192] OnlyFlow: Optical Flow based Motion Conditioning for Video Diffusion Models

Mathis Koroglu, Hugo Caselles-Dupré, Guillaume Jeanneret Sanmiguel, Matthieu Cord

Main category: cs.CV

TL;DR: OnlyFlow is a text-to-video generation method that uses optical flow from input videos to control motion, enabling precise motion control without task-specific training.

DetailsMotivation: To enable precise motion control in text-to-video generation for applications like camera movement control and video-to-video editing, avoiding reliance on user-defined controls like binary masks or camera embeddings.

Method: Extracts optical flow from input video using an optical flow estimation model, processes it through a trainable optical flow encoder, and injects the feature maps into a text-to-video backbone model.

Result: OnlyFlow performs comparably to state-of-the-art methods across various tasks despite not being specifically trained for them, as shown through quantitative, qualitative and user preference studies.

Conclusion: OnlyFlow provides a versatile, lightweight and efficient method for motion control in text-to-video generation that generalizes well across different tasks.

Abstract: We consider the problem of text-to-video generation tasks with precise control for various applications such as camera movement control and video-to-video editing. Most methods tacking this problem rely on providing user-defined controls, such as binary masks or camera movement embeddings. In our approach we propose OnlyFlow, an approach leveraging the optical flow firstly extracted from an input video to condition the motion of generated videos. Using a text prompt and an input video, OnlyFlow allows the user to generate videos that respect the motion of the input video as well as the text prompt. This is implemented through an optical flow estimation model applied on the input video, which is then fed to a trainable optical flow encoder. The output feature maps are then injected into the text-to-video backbone model. We perform quantitative, qualitative and user preference studies to show that OnlyFlow positively compares to state-of-the-art methods on a wide range of tasks, even though OnlyFlow was not specifically trained for such tasks. OnlyFlow thus constitutes a versatile, lightweight yet efficient method for controlling motion in text-to-video generation. Models and code will be made available on GitHub and HuggingFace.

[193] Resource Efficient Multi-stain Kidney Glomeruli Segmentation via Self-supervision

Zeeshan Nisar, Friedrich Feuerhake, Thomas Lampert

Main category: cs.CV

TL;DR: Self-supervised pre-training enables semantic segmentation with 95% fewer labels while maintaining performance in histopathology image analysis across multiple stains.

DetailsMotivation: Address the challenge of semantic segmentation under domain shift in histopathology, where obtaining labeled data across different imaging conditions (stains) is costly and time-consuming.

Method: Use self-supervised pre-training methods (SimCLR, BYOL, and novel HR-CS-CO) to enhance segmentation models (UNet and UDAGAN) with minimal labeled data.

Result: With only 5% labels and self-supervised pre-training, performance drops are minimal: 5.9% for UNet and 6.2% for UDAGAN compared to fully supervised counterparts.

Conclusion: Self-supervised pre-training significantly reduces label dependency for semantic segmentation in histopathology, maintaining performance with 95% fewer labels and generalizing to benchmark datasets.

Abstract: Semantic segmentation under domain shift remains a fundamental challenge in computer vision, particularly when labelled training data is scarce. This challenge is particularly exemplified in histopathology image analysis, where the same tissue structures must be segmented across images captured under different imaging conditions (stains), each representing a distinct visual domain. Traditional deep learning methods like UNet require extensive labels, which is both costly and time-consuming, particularly when dealing with multiple domains (or stains). To mitigate this, various unsupervised domain adaptation based methods such as UDAGAN have been proposed, which reduce the need for labels by requiring only one (source) stain to be labelled. Nonetheless, obtaining source stain labels can still be challenging. This article shows that through self-supervised pre-training – including SimCLR, BYOL, and a novel approach, HR-CS-CO – the performance of these segmentation methods (UNet, and UDAGAN) can be retained even with 95% fewer labels. Notably, with self-supervised pre-training and using only 5% labels, the performance drops are minimal: 5.9% for UNet and 6.2% for UDAGAN, averaged over all stains, compared to their respective fully supervised counterparts (without pre-training, using 100% labels). Furthermore, these findings are shown to generalise beyond their training distribution to public benchmark datasets. Implementations and pre-trained models are publicly available \href{https://github.com/zeeshannisar/resource-effecient-multi-stain-kidney-glomeruli-segmentation.git}{online}.

[194] Defending Multimodal Backdoored Models by Repulsive Visual Prompt Tuning

Zhifang Zhang, Shuo He, Haobo Wang, Bingquan Shen, Lei Feng

Main category: cs.CV

TL;DR: RVPT is a novel defense method that uses deep visual prompt tuning with feature-repelling loss to enhance CLIP’s resistance to backdoor attacks by repelling non-predictive features while maintaining only essential predictive patterns.

DetailsMotivation: CLIP models are vulnerable to backdoor attacks due to encoding features beyond in-dataset predictive patterns, compromising visual feature resistivity to input perturbations.

Method: Repulsive Visual Prompt Tuning (RVPT) employs deep visual prompt tuning with feature-repelling loss that adversarially repels encoded features from deeper layers while optimizing cross-entropy loss, using few-shot clean samples and tuning only 0.27% of parameters.

Result: RVPT reduces attack success rate from 89.70% to 2.76% against advanced multimodal attacks on ImageNet, generalizes across multiple datasets, and significantly outperforms state-of-the-art defense methods.

Conclusion: RVPT effectively enhances CLIP’s visual feature resistivity against backdoor attacks by focusing on predictive features only, providing efficient defense with minimal parameter tuning and clean data requirements.

Abstract: Multimodal contrastive learning models (e.g., CLIP) can learn high-quality representations from large-scale image-text datasets, while they exhibit significant vulnerabilities to backdoor attacks, raising serious safety concerns. In this paper, we reveal that CLIP’s vulnerabilities primarily stem from its tendency to encode features beyond in-dataset predictive patterns, compromising its visual feature resistivity to input perturbations. This makes its encoded features highly susceptible to being reshaped by backdoor triggers. To address this challenge, we propose Repulsive Visual Prompt Tuning (RVPT), a novel defense approach that employs deep visual prompt tuning with a specially designed feature-repelling loss. Specifically, RVPT adversarially repels the encoded features from deeper layers while optimizing the standard cross-entropy loss, ensuring that only predictive features in downstream tasks are encoded, thereby enhancing CLIP’s visual feature resistivity against input perturbations and mitigating its susceptibility to backdoor attacks. Unlike existing multimodal backdoor defense methods that typically require the availability of poisoned data or involve fine-tuning the entire model, RVPT leverages few-shot downstream clean samples and only tunes a small number of parameters. Empirical results demonstrate that RVPT tunes only 0.27% of the parameters in CLIP, yet it significantly outperforms state-of-the-art defense methods, reducing the attack success rate from 89.70% to 2.76% against the most advanced multimodal attacks on ImageNet and effectively generalizes its defensive capabilities across multiple datasets.

[195] UV-Attack: Physical-World Adversarial Attacks for Person Detection via Dynamic-NeRF-based UV Mapping

Yanjie Li, Kaisheng Liang, Bin Xiao

Main category: cs.CV

TL;DR: UV-Attack is a novel adversarial attack method that uses dynamic-NeRF-based UV mapping to generate adversarial textures for person detectors, achieving high success rates across diverse human poses and viewpoints by modifying UV maps instead of RGB images.

DetailsMotivation: Previous adversarial attacks on person detectors using patches or static 3D models had low success rates due to the flexible nature of human movement and challenges in modeling 3D deformations caused by various actions.

Method: Leverages dynamic-NeRF-based UV mapping to generate human images across diverse actions and viewpoints, creates UV maps instead of RGB images, modifies texture stacks, and uses Expectation over Pose Transformation loss (EoPT) to improve evasion on unseen poses and views.

Result: Achieves 92.7% attack success rate against FastRCNN across varied poses in dynamic video settings (significantly outperforming AdvCamou’s 28.5% ASR), and 49.5% ASR on YOLOv8 in black-box settings.

Conclusion: UV-Attack demonstrates the potential of dynamic NeRF-based UV mapping for creating more effective adversarial attacks on person detectors, successfully addressing challenges in modeling human movement and texture modification.

Abstract: In recent research, adversarial attacks on person detectors using patches or static 3D model-based texture modifications have struggled with low success rates due to the flexible nature of human movement. Modeling the 3D deformations caused by various actions has been a major challenge. Fortunately, advancements in Neural Radiance Fields (NeRF) for dynamic human modeling offer new possibilities. In this paper, we introduce UV-Attack, a groundbreaking approach that achieves high success rates even with extensive and unseen human actions. We address the challenge above by leveraging dynamic-NeRF-based UV mapping. UV-Attack can generate human images across diverse actions and viewpoints, and even create novel actions by sampling from the SMPL parameter space. While dynamic NeRF models are capable of modeling human bodies, modifying clothing textures is challenging because they are embedded in neural network parameters. To tackle this, UV-Attack generates UV maps instead of RGB images and modifies the texture stacks. This approach enables real-time texture edits and makes the attack more practical. We also propose a novel Expectation over Pose Transformation loss (EoPT) to improve the evasion success rate on unseen poses and views. Our experiments show that UV-Attack achieves a 92.7% attack success rate against the FastRCNN model across varied poses in dynamic video settings, significantly outperforming the state-of-the-art AdvCamou attack, which only had a 28.5% ASR. Moreover, we achieve 49.5% ASR on the latest YOLOv8 detector in black-box settings. This work highlights the potential of dynamic NeRF-based UV mapping for creating more effective adversarial attacks on person detectors, addressing key challenges in modeling human movement and texture modification. The code is available at https://github.com/PolyLiYJ/UV-Attack.

[196] GameFactory: Creating New Games with Generative Interactive Videos

Jiwen Yu, Yiran Qin, Xintao Wang, Pengfei Wan, Di Zhang, Xihui Liu

Main category: cs.CV

TL;DR: GameFactory is a framework for action-controlled scene-generalizable game video generation that enables precise control over keyboard/mouse inputs and supports unlimited-length interactive videos through autoregressive generation.

DetailsMotivation: To revolutionize game development by autonomously creating new content through generative videos, addressing the challenges of action controllability and scene generalization that existing methods fail to solve.

Method: Uses GF-Minecraft action-annotated dataset, action control module for precise input control, multi-phase training with domain adapter to decouple game style learning from action control, and leverages pre-trained video diffusion models’ open-domain priors.

Result: Effectively generates open-domain action-controllable game videos, demonstrating scene-generalizable action control beyond fixed styles and scenes.

Conclusion: GameFactory represents a significant step forward in AI-driven game generation by achieving scene-generalizable action control and enabling creation of entirely new and diverse games.

Abstract: Generative videos have the potential to revolutionize game development by autonomously creating new content. In this paper, we present GameFactory, a framework for action-controlled scene-generalizable game video generation. We first address the fundamental challenge of action controllability by introducing GF-Minecraft, an action-annotated game video dataset without human bias, and developing an action control module that enables precise control over both keyboard and mouse inputs. We further extend to support autoregressive generation for unlimited-length interactive videos. More importantly, GameFactory tackles the critical challenge of scene-generalizable action control, which most existing methods fail to address. To enable the creation of entirely new and diverse games beyond fixed styles and scenes, we leverage the open-domain generative priors from pre-trained video diffusion models. To bridge the domain gap between open-domain priors and small-scale game datasets, we propose a multi-phase training strategy with a domain adapter that decouples game style learning from action control. This decoupling ensures that action control learning is no longer bound to specific game styles, thereby achieving scene-generalizable action control. Experimental results demonstrate that GameFactory effectively generates open-domain action-controllable game videos, representing a significant step forward in AI-driven game generation.

[197] CAUSAL3D: A Comprehensive Benchmark for Causal Learning from Visual Data

Disheng Liu, Yiran Qiao, Wuche Liu, Yiren Lu, Yunlai Zhou, Tuo Liang, Yu Yin, Jing Ma

Main category: cs.CV

TL;DR: Causal3D is a new benchmark that combines structured data and visual representations to evaluate causal reasoning abilities in AI models, featuring 19 3D-scene datasets with diverse causal relations.

DetailsMotivation: There's a lack of benchmarks for assessing models' abilities to infer latent causality from complex visual data, despite true intelligence requiring the ability to uncover and leverage hidden causal relations.

Method: Created a comprehensive benchmark with 19 3D-scene datasets integrating structured data (tables) with corresponding visual representations (images), designed within a systematic framework to capture diverse causal relations, views, and backgrounds.

Result: Experiments showed that as causal structures grow more complex without prior knowledge, performance declines significantly across multiple state-of-the-art methods including classical causal discovery, causal representation learning, and LLMs/VLMs.

Conclusion: Causal3D serves as a vital resource for advancing causal reasoning in computer vision and fostering trustworthy AI in critical domains, highlighting the challenges even advanced methods face in complex causal scenarios.

Abstract: True intelligence hinges on the ability to uncover and leverage hidden causal relations. Despite significant progress in AI and computer vision (CV), there remains a lack of benchmarks for assessing models’ abilities to infer latent causality from complex visual data. In this paper, we introduce \textsc{\textbf{Causal3D}}, a novel and comprehensive benchmark that integrates structured data (tables) with corresponding visual representations (images) to evaluate causal reasoning. Designed within a systematic framework, Causal3D comprises 19 3D-scene datasets capturing diverse causal relations, views, and backgrounds, enabling evaluations across scenes of varying complexity. We assess multiple state-of-the-art methods, including classical causal discovery, causal representation learning, and large/vision-language models (LLMs/VLMs). Our experiments show that as causal structures grow more complex without prior knowledge, performance declines significantly, highlighting the challenges even advanced methods face in complex causal scenarios. Causal3D serves as a vital resource for advancing causal reasoning in CV and fostering trustworthy AI in critical domains.

[198] MindGYM: What Matters in Question Synthesis for Thinking-Centric Fine-Tuning?

Zhe Xu, Daoyuan Chen, Zhenqing Ling, Yaliang Li, Ying Shen

Main category: cs.CV

TL;DR: MindGYM is a structured framework for generating high-quality synthetic data that enhances foundation models’ thinking abilities through self-generated, cognitively guided data synthesis.

DetailsMotivation: Large foundation models struggle with acquiring transferable structured thinking abilities when trained on rigid templates or crowd-annotated datasets, requiring a more thinking-centric approach.

Method: Three-step framework: (1) Cognitive Thinking Process Injection to shape synthesis behavior, (2) Seed Single-Hop Question Synthesis for atomic questions, (3) Challenging Multi-Hop QA Synthesis for complex reasoning questions.

Result: Synthetic data achieves 16.7% higher quality and 67.91% lower variance than baselines. Improves performance on six reasoning benchmarks with up to 16% gain on MathVision using only 400 samples, with generalizable improvements across model sizes.

Conclusion: MindGYM demonstrates the viability of self-challenging mechanisms for refining large model capabilities while minimizing human intervention, promoting data-centric research for self-evolving foundation models.

Abstract: Large foundation models face challenges in acquiring transferable, structured thinking abilities, especially when supervised with rigid templates or crowd-annotated instruction datasets. Unlike prior approaches, we focus on a thinking-centric data synthesis paradigm that enables models to evolve through self-generated, cognitively guided data. We propose MindGYM, a structured and scalable framework for question synthesis, composed of: (1) Cognitive Thinking Process Injection, which infuses high-level reasoning objectives to shape the model’s synthesis behavior; (2) Seed Single-Hop Question Synthesis, generating atomic questions from diverse semantic types to encourage broader thinking; and (3) Challenging Multi-Hop QA Synthesis, composing more complex multi-hop questions based on QA seeds for deeper reasoning. Detailed analysis shows that synthetic data generated by our method achieves 16.7% higher average quality and 67.91% lower quality variance compared to baseline sources, highlighting that both high-quality and self-contained data are essential for effective, thinking-oriented fine-tuning. MindGYM improves performance on six reasoning benchmarks, achieving gains of up to 16% on MathVision using only 400 data samples, and generalizable improvements across different model sizes and architectures. MindGYM underscores the viability of self-challenging mechanisms in refining large model capabilities while minimizing human intervention and resource demands. Code and data are released to promote data-centric research into self-evolving foundation models driven by their internal reasoning capabilities.

[199] Open3D-VQA: A Benchmark for Comprehensive Spatial Reasoning with Multimodal Large Language Model in Open Space

Weichen Zhang, Zile Zhou, Xin Zeng, Xuchen Liu, Jianjie Fang, Chen Gao, Yong Li, Jinqiang Cui, Xinlei Chen, Xiao-Ping Zhang

Main category: cs.CV

TL;DR: Open3D-VQA is a novel benchmark for evaluating multimodal large language models’ spatial reasoning capabilities in aerial environments, featuring 73k QA pairs across 7 tasks and revealing key insights about model performance.

DetailsMotivation: Spatial reasoning is fundamental for MLLMs but their performance in open aerial environments remains underexplored, creating a need for comprehensive evaluation benchmarks.

Method: Created a benchmark with 73k QA pairs spanning 7 spatial reasoning tasks, automatically generated from spatial relations extracted from real-world and simulated aerial scenes, supporting both visual and point cloud modalities.

Result: Evaluation of 13 MLLMs showed: 1) Better performance on relative spatial relations than absolute distances, 2) 3D LLMs don’t significantly outperform 2D LLMs, 3) Fine-tuning on simulated data improves real-world performance.

Conclusion: The benchmark reveals important limitations in current MLLMs’ spatial reasoning and provides tools for future research, with the dataset and evaluation toolkit publicly released.

Abstract: Spatial reasoning is a fundamental capability of multimodal large language models (MLLMs), yet their performance in open aerial environments remains underexplored. In this work, we present Open3D-VQA, a novel benchmark for evaluating MLLMs’ ability to reason about complex spatial relationships from an aerial perspective. The benchmark comprises 73k QA pairs spanning 7 general spatial reasoning tasks, including multiple-choice, true/false, and short-answer formats, and supports both visual and point cloud modalities. The questions are automatically generated from spatial relations extracted from both real-world and simulated aerial scenes. Evaluation on 13 popular MLLMs reveals that: 1) Models are generally better at answering questions about relative spatial relations than absolute distances, 2) 3D LLMs fail to demonstrate significant advantages over 2D LLMs, and 3) Fine-tuning solely on the simulated dataset can significantly improve the model’s spatial reasoning performance in real-world scenarios. We release our benchmark, data generation pipeline, and evaluation toolkit to support further research: https://github.com/EmbodiedCity/Open3D-VQA.code.

[200] Language-guided Open-world Video Anomaly Detection under Weak Supervision

Zihao Liu, Xiaoyu Wu, Jianqin Wu, Xuxu Wang, Linlin Yang

Main category: cs.CV

TL;DR: LaGoVAD introduces an open-world video anomaly detection paradigm that adapts to variable anomaly definitions via natural language guidance, achieving state-of-the-art performance through dynamic video synthesis and contrastive learning.

DetailsMotivation: Existing video anomaly detection methods assume fixed anomaly definitions, making them unsuitable for open-world scenarios where definitions change based on context (e.g., mask-wearing during flu outbreaks).

Method: Proposes LaGoVAD with two regularization strategies: diversifying anomaly durations via dynamic video synthesis and enhancing feature robustness through contrastive learning with negative mining. Also introduces PreVAD dataset for training.

Result: Zero-shot experiments on seven datasets demonstrate state-of-the-art performance, validating the effectiveness of the language-guided approach.

Conclusion: LaGoVAD successfully enables adaptable video anomaly detection in open-world scenarios through natural language guidance, with the PreVAD dataset providing crucial training resources for this new paradigm.

Abstract: Video anomaly detection (VAD) aims to detect anomalies that deviate from what is expected. In open-world scenarios, the expected events may change as requirements change. For example, not wearing a mask may be considered abnormal during a flu outbreak but normal otherwise. However, existing methods assume that the definition of anomalies is invariable, and thus are not applicable to the open world. To address this, we propose a novel open-world VAD paradigm with variable definitions, allowing guided detection through user-provided natural language at inference time. This paradigm necessitates establishing a robust mapping from video and textual definition to anomaly scores. Therefore, we propose LaGoVAD (Language-guided Open-world Video Anomaly Detector), a model that dynamically adapts anomaly definitions under weak supervision with two regularization strategies: diversifying the relative durations of anomalies via dynamic video synthesis, and enhancing feature robustness through contrastive learning with negative mining. Training such adaptable models requires diverse anomaly definitions, but existing datasets typically provide labels without semantic descriptions. To bridge this gap, we collect PreVAD (Pre-training Video Anomaly Dataset), the largest and most diverse video anomaly dataset to date, featuring 35,279 annotated videos with multi-level category labels and descriptions that explicitly define anomalies. Zero-shot experiments on seven datasets demonstrate LaGoVAD’s SOTA performance. Our dataset and code will be released at https://github.com/Kamino666/LaGoVAD-PreVAD.

[201] LATex: Leveraging Attribute-based Text Knowledge for Aerial-Ground Person Re-Identification

Pingping Zhang, Xiang Hu, Yuhao Wang, Huchuan Lu

Main category: cs.CV

TL;DR: LATex is a novel framework for Aerial-Ground person Re-ID that uses prompt-tuning with CLIP to leverage attribute-based text knowledge, addressing limitations of previous methods that overlook semantic attributes and require expensive full fine-tuning.

DetailsMotivation: Previous AG-ReID methods focus on view-invariant features but overlook semantic attribute information, and existing training strategies rely on costly full fine-tuning of large models.

Method: Uses CLIP model with Attribute-aware Image Encoder (AIE) to extract global semantic and attribute-aware features, Prompted Attribute Classifier Group (PACG) to predict attributes, and Coupled Prompt Template (CPT) to transform attributes and view info into structured sentences for text encoder.

Result: Extensive experiments on three AG-ReID benchmarks demonstrate the effectiveness of the proposed methods in improving AG-ReID performance.

Conclusion: The LATex framework successfully leverages attribute-based text knowledge through prompt-tuning strategies to enhance AG-ReID performance while reducing training costs compared to full fine-tuning approaches.

Abstract: As an important task in intelligent transportation systems, Aerial-Ground person Re-IDentification (AG-ReID) aims to retrieve specific persons across heterogeneous cameras in different viewpoints. Previous methods typically adopt deep learning-based models, focusing on extracting view-invariant features. However, they usually overlook the semantic information in person attributes. In addition, existing training strategies often rely on full fine-tuning large-scale models, which significantly increases training costs. To address these issues, we propose a novel framework named LATex for AG-ReID, which adopts prompt-tuning strategies to leverage attribute-based text knowledge. Specifically, with the Contrastive Language-Image Pre-training (CLIP) model, we first propose an Attribute-aware Image Encoder (AIE) to extract both global semantic features and attribute-aware features from input images. Then, with these features, we propose a Prompted Attribute Classifier Group (PACG) to predict person attributes and obtain attribute representations. Finally, we design a Coupled Prompt Template (CPT) to transform attribute representations and view information into structured sentences. These sentences are processed by the text encoder of CLIP to generate more discriminative features. As a result, our framework can fully leverage attribute-based text knowledge to improve AG-ReID performance. Extensive experiments on three AG-ReID benchmarks demonstrate the effectiveness of our proposed methods. The source code is available at https://github.com/kevinhu314/LATex.

[202] SD-ReID: View-aware Stable Diffusion for Aerial-Ground Person Re-Identification

Yuhao Wang, Xiang Hu, Lixin Wang, Pingping Zhang, Huchuan Lu

Main category: cs.CV

TL;DR: SD-ReID is a generative framework for aerial-ground person re-identification that uses Stable Diffusion to enhance person representations by mimicking feature distributions across different viewpoints, achieving state-of-the-art performance on multiple benchmarks.

DetailsMotivation: Previous AG-ReID methods focus on view-robust models but overlook the importance of view-specific features. Designing view-robust models is challenging, and existing approaches don't fully leverage view-specific information to enhance person representation.

Method: Proposes SD-ReID framework: 1) Train ViT-based model to extract person representations with controllable identity and view conditions, 2) Fine-tune Stable Diffusion to enhance representations guided by these conditions, 3) Introduce View-Refined Decoder to bridge instance-level and global-level features, 4) Use both person representations and all-view features for retrieval.

Result: Extensive experiments on five AG-ReID benchmarks (CARGO, AG-ReIDv1, AG-ReIDv2, LAGPeR, G2APS-ReID) demonstrate the effectiveness of the proposed method, achieving superior performance compared to previous approaches.

Conclusion: The SD-ReID framework successfully leverages generative models to enhance person representations in aerial-ground re-identification by incorporating both identity consistency and view-specific features, providing a novel solution to the challenging cross-view retrieval problem.

Abstract: Aerial-Ground Person Re-IDentification (AG-ReID) aims to retrieve specific persons across cameras with different viewpoints. Previous works focus on designing discriminative models to maintain the identity consistency despite drastic changes in camera viewpoints. The core idea behind these methods is quite natural, but designing a view-robust model is a very challenging task. Moreover, they overlook the contribution of view-specific features in enhancing the model’s ability to represent persons. To address these issues, we propose a novel generative framework named SD-ReID for AG-ReID, which leverages generative models to mimic the feature distribution of different views while extracting robust identity representations. More specifically, we first train a ViT-based model to extract person representations along with controllable conditions, including identity and view conditions. We then fine-tune the Stable Diffusion (SD) model to enhance person representations guided by these controllable conditions. Furthermore, we introduce the View-Refined Decoder (VRD) to bridge the gap between instance-level and global-level features. Finally, both person representations and all-view features are employed to retrieve target persons. Extensive experiments on five AG-ReID benchmarks (i.e., CARGO, AG-ReIDv1, AG-ReIDv2, LAGPeR and G2APS-ReID) demonstrate the effectiveness of our proposed method. The source code will be available.

[203] Empowering Agentic Video Analytics Systems with Video Language Models

Yuxuan Yan, Shiqi Jiang, Ting Cao, Yifan Yang, Qianqian Yang, Yuanchao Shu, Yuqing Yang, Lili Qiu

Main category: cs.CV

TL;DR: AVA is a VLM-powered system for open-ended video analytics that addresses context window limitations through Event Knowledge Graphs and agentic retrieval-generation, achieving SOTA performance on benchmarks.

DetailsMotivation: Existing video analytics systems are limited to predefined tasks and struggle with ultra-long videos due to VLM context window constraints, requiring a more adaptable solution.

Method: Uses Event Knowledge Graphs for efficient indexing of long videos and agentic retrieval-generation mechanism to handle complex queries by leveraging the EKGs.

Result: Achieved 62.3% on LVBench, 64.1% on VideoMME-Long, and 75.8% on new AVA-100 benchmark, significantly outperforming existing VLM and video RAG systems.

Conclusion: AVA effectively enables open-ended video analytics for ultra-long videos through its innovative EKG and retrieval-generation approach, demonstrating superior performance across multiple benchmarks.

Abstract: AI-driven video analytics has become increasingly important across diverse domains. However, existing systems are often constrained to specific, predefined tasks, limiting their adaptability in open-ended analytical scenarios. The recent emergence of Vision Language Models (VLMs) as transformative technologies offers significant potential for enabling open-ended video understanding, reasoning, and analytics. Nevertheless, their limited context windows present challenges when processing ultra-long video content, which is prevalent in real-world applications. To address this, we introduce AVA, a VLM-powered system designed for open-ended, advanced video analytics. AVA incorporates two key innovations: (1) the near real-time construction of Event Knowledge Graphs (EKGs) for efficient indexing of long or continuous video streams, and (2) an agentic retrieval-generation mechanism that leverages EKGs to handle complex and diverse queries. Comprehensive evaluations on public benchmarks, LVBench and VideoMME-Long, demonstrate that AVA achieves state-of-the-art performance, attaining 62.3% and 64.1% accuracy, respectively-significantly surpassing existing VLM and video Retrieval-Augmented Generation (RAG) systems. Furthermore, to evaluate video analytics in ultra-long and open-world video scenarios, we introduce a new benchmark, AVA-100. This benchmark comprises 8 videos, each exceeding 10 hours in duration, along with 120 manually annotated, diverse, and complex question-answer pairs. On AVA-100, AVA achieves top-tier performance with an accuracy of 75.8%. The source code of AVA is available at https://github.com/I-ESC/Project-Ava. The AVA-100 benchmark can be accessed at https://huggingface.co/datasets/iesc/Ava-100.

[204] Ditch the Denoiser: Emergence of Noise Robustness in Self-Supervised Learning from Data Curriculum

Wenquan Lu, Jiaqi Zhang, Hugues Van Assel, Randall Balestriero

Main category: cs.CV

TL;DR: A self-supervised learning framework that enables noise-robust representation learning without requiring a denoiser at inference or downstream fine-tuning, using a denoised-to-noisy curriculum and teacher-guided regularization.

DetailsMotivation: SSL research has focused on clean datasets, but applying SSL to noisy data is crucial for fields like astrophysics, medical imaging, geophysics, and finance. Current methods struggle with noisy data.

Method: Trains an SSL denoiser on noisy data, then constructs a denoised-to-noisy curriculum for pretraining SSL backbone (e.g., DINOv2) combined with teacher-guided regularization that anchors noisy embeddings to denoised counterparts.

Result: On ImageNet-1k with ViT-B under extreme Gaussian noise (σ=255, SNR=0.72 dB), the method improves linear probing accuracy by 4.8% over DINOv2.

Conclusion: Denoiser-free robustness can emerge from noise-aware pretraining, and the denoiser can be discarded after pretraining, simplifying deployment.

Abstract: Self-Supervised Learning (SSL) has become a powerful solution to extract rich representations from unlabeled data. Yet, SSL research is mostly focused on clean, curated and high-quality datasets. As a result, applying SSL on noisy data remains a challenge, despite being crucial to applications such as astrophysics, medical imaging, geophysics or finance. In this work, we present a fully self-supervised framework that enables noise-robust representation learning without requiring a denoiser at inference or downstream fine-tuning. Our method first trains an SSL denoiser on noisy data, then uses it to construct a denoised-to-noisy data curriculum (i.e., training first on denoised, then noisy samples) for pretraining a SSL backbone (e.g., DINOv2), combined with a teacher-guided regularization that anchors noisy embeddings to their denoised counterparts. This process encourages the model to internalize noise robustness. Notably, the denoiser can be discarded after pretraining, simplifying deployment. On ImageNet-1k with ViT-B under extreme Gaussian noise ($\sigma=255$, SNR = 0.72 dB), our method improves linear probing accuracy by 4.8% over DINOv2, demonstrating that denoiser-free robustness can emerge from noise-aware pretraining. The code is available at https://github.com/wenquanlu/noisy_dinov2.

[205] DOVE: Efficient One-Step Diffusion Model for Real-World Video Super-Resolution

Zheng Chen, Zichen Zou, Kewei Zhang, Xiongfei Su, Xin Yuan, Yong Guo, Yulun Zhang

Main category: cs.CV

TL;DR: DOVE is an efficient one-step diffusion model for real-world video super-resolution that achieves comparable performance to multi-step methods while providing 28x speed-up.

DetailsMotivation: Diffusion models show promise in video super-resolution but are extremely slow due to requiring dozens of sampling steps. Single-step sampling could solve this but remains challenging due to high training overhead and fidelity demands.

Method: Fine-tune pretrained CogVideoX model using latent-pixel training strategy with two-stage adaptation to video super-resolution. Construct HQ-VSR dataset with specialized video processing pipeline for enhanced training.

Result: DOVE achieves comparable or superior performance to multi-step diffusion-based VSR methods while offering outstanding inference efficiency with up to 28x speed-up over existing methods like MGLD-VSR.

Conclusion: DOVE demonstrates that efficient one-step diffusion models can achieve high-quality video super-resolution with significant speed improvements, making real-world VSR more practical.

Abstract: Diffusion models have demonstrated promising performance in real-world video super-resolution (VSR). However, the dozens of sampling steps they require, make inference extremely slow. Sampling acceleration techniques, particularly single-step, provide a potential solution. Nonetheless, achieving one step in VSR remains challenging, due to the high training overhead on video data and stringent fidelity demands. To tackle the above issues, we propose DOVE, an efficient one-step diffusion model for real-world VSR. DOVE is obtained by fine-tuning a pretrained video diffusion model (i.e., CogVideoX). To effectively train DOVE, we introduce the latent-pixel training strategy. The strategy employs a two-stage scheme to gradually adapt the model to the video super-resolution task. Meanwhile, we design a video processing pipeline to construct a high-quality dataset tailored for VSR, termed HQ-VSR. Fine-tuning on this dataset further enhances the restoration capability of DOVE. Extensive experiments show that DOVE exhibits comparable or superior performance to multi-step diffusion-based VSR methods. It also offers outstanding inference efficiency, achieving up to a 28$\times$ speed-up over existing methods such as MGLD-VSR. Code is available at: https://github.com/zhengchen1999/DOVE.

[206] Unleashing Diffusion Transformers for Visual Correspondence by Modulating Massive Activations

Chaofan Gan, Yuanpeng Tu, Xi Chen, Tieyuan Chen, Yuxi Li, Mehrtash Harandi, Weiyao Lin

Main category: cs.CV

TL;DR: DiTF is a training-free framework that extracts semantic-discriminative features from Diffusion Transformers by addressing massive activations through AdaLN-zero localization and channel discard strategies.

DetailsMotivation: Diffusion Transformers (DiTs) suffer from massive activations where few feature dimensions dominate, leading to uninformative representations and performance degradation in dense correspondence tasks.

Method: Proposed DiTF framework uses AdaLN to localize and normalize massive activations with channel-wise modulation, plus a channel discard strategy to eliminate negative impacts from these activations.

Result: DiTF outperforms both DINO and SD-based models, achieving +9.4% on Spair-71k and +4.4% on AP-10K-C.S., establishing new SOTA for DiTs in visual correspondence.

Conclusion: The proposed DiTF framework effectively addresses massive activations in DiTs and enables superior feature extraction for dense correspondence tasks without requiring additional training.

Abstract: Pre-trained stable diffusion models (SD) have shown great advances in visual correspondence. In this paper, we investigate the capabilities of Diffusion Transformers (DiTs) for accurate dense correspondence. Distinct from SD, DiTs exhibit a critical phenomenon in which very few feature activations exhibit significantly larger values than others, known as \textit{massive activations}, leading to uninformative representations and significant performance degradation for DiTs. The massive activations consistently concentrate at very few fixed dimensions across all image patch tokens, holding little local information. We trace these dimension-concentrated massive activations and find that such concentration can be effectively localized by the zero-initialized Adaptive Layer Norm (AdaLN-zero). Building on these findings, we propose Diffusion Transformer Feature (DiTF), a training-free framework designed to extract semantic-discriminative features from DiTs. Specifically, DiTF employs AdaLN to adaptively localize and normalize massive activations with channel-wise modulation. In addition, we develop a channel discard strategy to further eliminate the negative impacts from massive activations. Experimental results demonstrate that our DiTF outperforms both DINO and SD-based models and establishes a new state-of-the-art performance for DiTs in different visual correspondence tasks (\eg, with +9.4% on Spair-71k and +4.4% on AP-10K-C.S.).

[207] StyleGuard: Preventing Text-to-Image-Model-based Style Mimicry Attacks by Style Perturbations

Yanjie Li, Wenxuan Zhang, Xinqi Lyu, Yihao Liu, Bin Xiao

Main category: cs.CV

TL;DR: StyleGuard is a novel anti-mimicry method that protects images from style mimicry attacks by optimizing style-related features in latent space and incorporating ensemble purification during training, achieving superior robustness against various transformations and purification methods.

DetailsMotivation: Address concerns about intellectual property protection and deceptive content generation from text-to-image diffusion models, while overcoming limitations of existing defenses that are vulnerable to purification attacks and lack transferability across different models.

Method: Proposes a novel style loss to optimize style-related features in latent space for better transferability, and an upscale loss that involves ensemble purifiers and upscalers during training to bypass diffusion-based purification.

Result: Extensive experiments on WikiArt and CelebA datasets show StyleGuard outperforms existing methods in robustness against various transformations and purifications, effectively countering style mimicry in various models including DreamBooth and Textual Inversion.

Conclusion: StyleGuard provides an effective defense against style mimicry attacks with improved transferability and robustness against purification methods, offering better protection for intellectual property in text-to-image generation.

Abstract: Recently, text-to-image diffusion models have been widely used for style mimicry and personalized customization through methods such as DreamBooth and Textual Inversion. This has raised concerns about intellectual property protection and the generation of deceptive content. Recent studies, such as Glaze and Anti-DreamBooth, have proposed using adversarial noise to protect images from these attacks. However, recent purification-based methods, such as DiffPure and Noise Upscaling, have successfully attacked these latest defenses, showing the vulnerabilities of these methods. Moreover, present methods show limited transferability across models, making them less effective against unknown text-to-image models. To address these issues, we propose a novel anti-mimicry method, StyleGuard. We propose a novel style loss that optimizes the style-related features in the latent space to make it deviate from the original image, which improves model-agnostic transferability. Additionally, to enhance the perturbation’s ability to bypass diffusion-based purification, we designed a novel upscale loss that involves ensemble purifiers and upscalers during training. Extensive experiments on the WikiArt and CelebA datasets demonstrate that StyleGuard outperforms existing methods in robustness against various transformations and purifications, effectively countering style mimicry in various models. Moreover, StyleGuard is effective on different style mimicry methods, including DreamBooth and Textual Inversion. The code is available at https://github.com/PolyLiYJ/StyleGuard.

[208] Learning World Models for Interactive Video Generation

Taiye Chen, Xun Hu, Zihan Ding, Chi Jin

Main category: cs.CV

TL;DR: The paper proposes VRAG (video retrieval augmented generation) with explicit global state conditioning to address compounding errors and improve spatiotemporal consistency in interactive world models for video generation.

DetailsMotivation: Current long video generation models have limited world modeling capabilities due to compounding errors and insufficient memory mechanisms, which affects their ability to maintain spatiotemporal coherence for effective future planning with actions.

Method: Enhanced image-to-video models with action conditioning and autoregressive framework, then proposed VRAG with explicit global state conditioning to reduce compounding errors and improve consistency.

Result: VRAG significantly reduces long-term compounding errors and increases spatiotemporal consistency of world models, while naive autoregressive generation with extended context windows and retrieval-augmented generation proved less effective.

Conclusion: The work illuminates fundamental challenges in video world models and establishes a comprehensive benchmark for improving video generation models with internal world modeling capabilities.

Abstract: Foundational world models must be both interactive and preserve spatiotemporal coherence for effective future planning with action choices. However, present models for long video generation have limited inherent world modeling capabilities due to two main challenges: compounding errors and insufficient memory mechanisms. We enhance image-to-video models with interactive capabilities through additional action conditioning and autoregressive framework, and reveal that compounding error is inherently irreducible in autoregressive video generation, while insufficient memory mechanism leads to incoherence of world models. We propose video retrieval augmented generation (VRAG) with explicit global state conditioning, which significantly reduces long-term compounding errors and increases spatiotemporal consistency of world models. In contrast, naive autoregressive generation with extended context windows and retrieval-augmented generation prove less effective for video generation, primarily due to the limited in-context learning capabilities of current video models. Our work illuminates the fundamental challenges in video world models and establishes a comprehensive benchmark for improving video generation models with internal world modeling capabilities.

[209] LODGE: Level-of-Detail Large-Scale Gaussian Splatting with Efficient Rendering

Jonas Kulhanek, Marie-Julie Rakotosaona, Fabian Manhardt, Christina Tsalicoglou, Michael Niemeyer, Torsten Sattler, Songyou Peng, Federico Tombari

Main category: cs.CV

TL;DR: A hierarchical LOD method for 3D Gaussian Splatting that enables real-time rendering of large scenes on memory-constrained devices by selecting optimal Gaussian subsets based on camera distance.

DetailsMotivation: To enable real-time rendering of large-scale 3D scenes on memory-constrained devices by reducing both rendering time and GPU memory usage while maintaining visual quality.

Method: Hierarchical LOD representation with depth-aware 3D smoothing, importance-based pruning, fine-tuning, spatial chunking with dynamic loading, and opacity-blending for boundary artifacts.

Result: Achieves state-of-the-art performance on outdoor (Hierarchical 3DGS) and indoor (Zip-NeRF) datasets with high-quality renderings, reduced latency, and lower memory requirements.

Conclusion: The proposed method successfully enables efficient real-time rendering of large-scale 3D scenes on resource-constrained devices while preserving visual fidelity through hierarchical LOD optimization.

Abstract: In this work, we present a novel level-of-detail (LOD) method for 3D Gaussian Splatting that enables real-time rendering of large-scale scenes on memory-constrained devices. Our approach introduces a hierarchical LOD representation that iteratively selects optimal subsets of Gaussians based on camera distance, thus largely reducing both rendering time and GPU memory usage. We construct each LOD level by applying a depth-aware 3D smoothing filter, followed by importance-based pruning and fine-tuning to maintain visual fidelity. To further reduce memory overhead, we partition the scene into spatial chunks and dynamically load only relevant Gaussians during rendering, employing an opacity-blending mechanism to avoid visual artifacts at chunk boundaries. Our method achieves state-of-the-art performance on both outdoor (Hierarchical 3DGS) and indoor (Zip-NeRF) datasets, delivering high-quality renderings with reduced latency and memory requirements.

[210] Towards Predicting Any Human Trajectory In Context

Ryo Fujii, Hideo Saito, Ryo Hachiuma

Main category: cs.CV

TL;DR: TrajICL is an In-Context Learning framework for pedestrian trajectory prediction that enables adaptation without fine-tuning, using spatio-temporal similarity-based example selection and prediction-guided selection methods trained on large-scale synthetic data.

DetailsMotivation: Current pedestrian trajectory prediction methods require fine-tuning for each new scenario, which is impractical for edge device deployment. There's a need for adaptable models that can work across different environments without weight updates.

Method: Uses In-Context Learning with spatio-temporal similarity-based example selection (STES) to find relevant motion patterns, and prediction-guided example selection (PG-ES) that considers both past and predicted future trajectories. Trained on large-scale synthetic dataset for enhanced prediction ability.

Result: Achieves remarkable adaptation across both in-domain and cross-domain scenarios, outperforming even fine-tuned approaches across multiple public benchmarks.

Conclusion: TrajICL provides an effective framework for pedestrian trajectory prediction that eliminates the need for scenario-specific fine-tuning while maintaining high performance across diverse environments.

Abstract: Predicting accurate future trajectories of pedestrians is essential for autonomous systems but remains a challenging task due to the need for adaptability in different environments and domains. A common approach involves collecting scenario-specific data and performing fine-tuning via backpropagation. However, the need to fine-tune for each new scenario is often impractical for deployment on edge devices. To address this challenge, we introduce \paper, an In-Context Learning (ICL) framework for pedestrian trajectory prediction that enables adaptation without fine-tuning on the scenario-specific data at inference time without requiring weight updates. We propose a spatio-temporal similarity-based example selection (STES) method that selects relevant examples from previously observed trajectories within the same scene by identifying similar motion patterns at corresponding locations. To further refine this selection, we introduce prediction-guided example selection (PG-ES), which selects examples based on both the past trajectory and the predicted future trajectory, rather than relying solely on the past trajectory. This approach allows the model to account for long-term dynamics when selecting examples. Finally, instead of relying on small real-world datasets with limited scenario diversity, we train our model on a large-scale synthetic dataset to enhance its prediction ability by leveraging in-context examples. Extensive experiments demonstrate that TrajICL achieves remarkable adaptation across both in-domain and cross-domain scenarios, outperforming even fine-tuned approaches across multiple public benchmarks. Project Page: https://fujiry0.github.io/TrajICL-project-page/.

[211] RRCANet: Recurrent Reusable-Convolution Attention Network for Infrared Small Target Detection

Yongxian Liu, Boyang Li, Ting Liu, Zaiping Lin, Wei An

Main category: cs.CV

TL;DR: RRCA-Net is a recurrent reusable-convolution attention network for efficient infrared small target detection that maintains high-level target information through repetitive iteration and enhances contextual correlation between layers.

DetailsMotivation: Infrared small target detection is challenging due to targets being small, dim, shapeless and changeable. Existing CNN-based methods achieve good performance but with heavy feature extraction modules, making efficient detection difficult.

Method: Proposes RRCA-Net with reusable-convolution blocks (RuCB) in recurrent manner without extra parameters, dual interactive attention aggregation module (DIAAM) for mutual enhancement and fusion, and target characteristic inspired loss function (DpT-k loss) with physical and mathematical constraints.

Result: Experimental results on three benchmark datasets (NUAA-SIRST, IRSTD-1k, DenseSIRST) show comparable performance to state-of-the-art methods while maintaining small parameter count, and can be used as plug-and-play module to improve other IRSTD methods.

Conclusion: RRCA-Net achieves efficient and effective infrared small target detection through recurrent reusable-convolution architecture, attention-based feature fusion, and specialized loss function, demonstrating both high performance and parameter efficiency.

Abstract: Infrared small target detection is a challenging task due to its unique characteristics (e.g., small, dim, shapeless and changeable). Recently published CNN-based methods have achieved promising performance with heavy feature extraction and fusion modules. To achieve efficient and effective detection, we propose a recurrent reusable-convolution attention network (RRCA-Net) for infrared small target detection. Specifically, RRCA-Net incorporates reusable-convolution block (RuCB) in a recurrent manner without introducing extra parameters. With the help of the repetitive iteration in RuCB, the high-level information of small targets in the deep layers can be well maintained and further refined. Then, a dual interactive attention aggregation module (DIAAM) is proposed to promote the mutual enhancement and fusion of refined information. In this way, RRCA-Net can both achieve high-level feature refinement and enhance the correlation of contextual information between adjacent layers. Moreover, to achieve steady convergence, we design a target characteristic inspired loss function (DpT-k loss) by integrating physical and mathematical constraints. Experimental results on three benchmark datasets (e.g. NUAA-SIRST, IRSTD-1k, DenseSIRST) demonstrate that our RRCA-Net can achieve comparable performance to the state-of-the-art methods while maintaining a small number of parameters, and act as a plug and play module to introduce consistent performance improvement for several popular IRSTD methods.

[212] MoralCLIP: Contrastive Alignment of Vision-and-Language Representations with Moral Foundations Theory

Ana Carolina Condez, Diogo Tavares, João Magalhães

Main category: cs.CV

TL;DR: MoralCLIP extends multimodal learning with moral grounding using Moral Foundations Theory, creating morally-aware embeddings for vision-language models.

DetailsMotivation: Current vision-language models lack moral reasoning capabilities, which is crucial for human-like cognition and morally-aware AI systems.

Method: Uses Moral Foundations Theory to integrate visual/textual moral cues into unified embeddings, with moral data augmentation scaling dataset to 15,000 image-text pairs.

Result: Explicit moral supervision improves both unimodal and multimodal understanding of moral content, enabling moral recognition and alignment.

Conclusion: Establishes foundation for morally-aware AI systems capable of recognizing and aligning with human moral values.

Abstract: Recent advances in vision-language models have enabled rich semantic understanding across modalities. However, these encoding methods lack the ability to interpret or reason about the moral dimensions of content-a crucial aspect of human cognition. In this paper, we address this gap by introducing MoralCLIP, a novel embedding representation method that extends multimodal learning with explicit moral grounding based on Moral Foundations Theory (MFT). Our approach integrates visual and textual moral cues into a unified embedding space, enabling cross-modal moral alignment. MoralCLIP is grounded on the multi-label dataset Social-Moral Image Database to identify co-occurring moral foundations in visual content. For MoralCLIP training, we design a moral data augmentation strategy to scale our annotated dataset to 15,000 image-text pairs labeled with MFT-aligned dimensions. Our results demonstrate that explicit moral supervision improves both unimodal and multimodal understanding of moral content, establishing a foundation for morally-aware AI systems capable of recognizing and aligning with human moral values.

[213] GenIR: Generative Visual Feedback for Mental Image Retrieval

Diji Yang, Minghao Liu, Chung-Hsiang Lo, Yi Zhang, James Davis

Main category: cs.CV

TL;DR: GenIR is a generative multi-round retrieval system that uses diffusion-based image generation to provide explicit visual feedback for mental image retrieval, enabling users to refine queries through interactive visual representations.

DetailsMotivation: Current vision-language models perform well on text-to-image retrieval but fail to support realistic human search behavior, which involves multi-round interactions with vague mental images that evolve through refinement.

Method: Proposes GenIR, a generative retrieval paradigm using diffusion models to create synthetic visual representations that reify the AI’s understanding at each interaction round, providing clear visual feedback for query refinement.

Result: GenIR significantly outperforms existing interactive methods in mental image retrieval scenarios and introduces a high-quality multi-round MIR dataset with automated generation pipeline.

Conclusion: This work establishes the Mental Image Retrieval task with a new dataset and effective generative method, providing foundation for future research in interactive visual search systems.

Abstract: Vision-language models (VLMs) have shown strong performance on text-to-image retrieval benchmarks. However, bridging this success to real-world applications remains a challenge. In practice, human search behavior is rarely a one-shot action. Instead, it is often a multi-round process guided by clues in mind. That is, a mental image ranging from vague recollections to vivid mental representations of the target image. Motivated by this gap, we study the task of Mental Image Retrieval (MIR), which targets the realistic yet underexplored setting where users refine their search for a mentally envisioned image through multi-round interactions with an image search engine. Central to successful interactive retrieval is the capability of machines to provide users with clear, actionable feedback; however, existing methods rely on indirect or abstract verbal feedback, which can be ambiguous, misleading, or ineffective for users to refine the query. To overcome this, we propose GenIR, a generative multi-round retrieval paradigm leveraging diffusion-based image generation to explicitly reify the AI system’s understanding at each round. These synthetic visual representations provide clear, interpretable feedback, enabling users to refine their queries intuitively and effectively. We further introduce a fully automated pipeline to generate a high-quality multi-round MIR dataset. Experimental results demonstrate that GenIR significantly outperforms existing interactive methods in the MIR scenario. This work establishes a new task with a dataset and an effective generative retrieval method, providing a foundation for future research in this direction

[214] SPARKE: Scalable Prompt-Aware Diversity and Novelty Guidance in Diffusion Models via RKE Score

Mohammad Jalali, Haoyu Lei, Amin Gohari, Farzan Farnia

Main category: cs.CV

TL;DR: SPARKE is a method that enhances prompt-aware diversity in diffusion models by using conditional entropy for diversity guidance, reducing computational complexity from O(n³) to O(n) for scalable generation.

DetailsMotivation: Ensuring adequate diversity in prompt-guided diffusion models is challenging, especially when prompts span a broad semantic spectrum and diversity needs to be evaluated in a prompt-aware manner across similar prompts.

Method: Proposes Scalable Prompt-Aware Rényi Kernel Entropy Diversity Guidance (SPARKE) using conditional entropy for diversity guidance that dynamically conditions diversity measurement on similar prompts, with a special case of Conditional latent RKE Score Guidance to reduce computational complexity.

Result: SPARKE improves prompt-aware diversity of generated data in text-to-image diffusion models without significant computational costs, enabling diversity-guided sampling over potentially thousands of generation rounds.

Conclusion: SPARKE effectively addresses the diversity challenge in prompt-guided diffusion models through computationally efficient conditional entropy-based guidance that maintains prompt-aware diversity control.

Abstract: Diffusion models have demonstrated remarkable success in high-fidelity image synthesis and prompt-guided generative modeling. However, ensuring adequate diversity in generated samples of prompt-guided diffusion models remains a challenge, particularly when the prompts span a broad semantic spectrum and the diversity of generated data needs to be evaluated in a prompt-aware fashion across semantically similar prompts. Recent methods have introduced guidance via diversity measures to encourage more varied generations. In this work, we extend the diversity measure-based approaches by proposing the Scalable Prompt-Aware R'eny Kernel Entropy Diversity Guidance (SPARKE) method for prompt-aware diversity guidance. SPARKE utilizes conditional entropy for diversity guidance, which dynamically conditions diversity measurement on similar prompts and enables prompt-aware diversity control. While the entropy-based guidance approach enhances prompt-aware diversity, its reliance on the matrix-based entropy scores poses computational challenges in large-scale generation settings. To address this, we focus on the special case of Conditional latent RKE Score Guidance, reducing entropy computation and gradient-based optimization complexity from the $O(n^3)$ of general entropy measures to $O(n)$. The reduced computational complexity allows for diversity-guided sampling over potentially thousands of generation rounds on different prompts. We numerically test the SPARKE method on several text-to-image diffusion models, demonstrating that the proposed method improves the prompt-aware diversity of the generated data without incurring significant computational costs. We release our code on the project page: https://mjalali.github.io/SPARKE

[215] Boosting Generative Adversarial Transferability with Self-supervised Vision Transformer Features

Shangbo Wu, Yu-an Tan, Ruinan Ma, Wencong Ma, Dehua Zhu, Yuanzhang Li

Main category: cs.CV

TL;DR: dSVA is a generative adversarial attack that exploits both global structural features from contrastive learning and local textural features from masked image modeling in self-supervised Vision Transformers to improve black-box transferability.

DetailsMotivation: Previous adversarial attacks rely on supervised learning features, but this paper explores whether self-supervised Vision Transformer representations can improve adversarial transferability, inspired by the synergy between self-supervised learning and Transformer architecture.

Method: Proposes dSVA - a generative dual self-supervised ViT features attack that exploits both global structural features from contrastive learning and local textural features from masked image modeling. Uses a novel generative training framework with a generator to create adversarial examples and strategies to train the generator using joint features and attention mechanism of self-supervised ViTs.

Result: The method achieves remarkable black-box transferability to models of various architectures that outperforms state-of-the-art methods. CL and MIM enable ViTs to attend to distinct feature tendencies which, when exploited together, provide great adversarial generalizability.

Conclusion: Exploiting dual deep features distilled by self-supervised ViTs through disrupting both global structural and local textural features leads to superior adversarial transferability across different model architectures.

Abstract: The ability of deep neural networks (DNNs) come from extracting and interpreting features from the data provided. By exploiting intermediate features in DNNs instead of relying on hard labels, we craft adversarial perturbation that generalize more effectively, boosting black-box transferability. These features ubiquitously come from supervised learning in previous work. Inspired by the exceptional synergy between self-supervised learning and the Transformer architecture, this paper explores whether exploiting self-supervised Vision Transformer (ViT) representations can improve adversarial transferability. We present dSVA – a generative dual self-supervised ViT features attack, that exploits both global structural features from contrastive learning (CL) and local textural features from masked image modeling (MIM), the self-supervised learning paradigm duo for ViTs. We design a novel generative training framework that incorporates a generator to create black-box adversarial examples, and strategies to train the generator by exploiting joint features and the attention mechanism of self-supervised ViTs. Our findings show that CL and MIM enable ViTs to attend to distinct feature tendencies, which, when exploited in tandem, boast great adversarial generalizability. By disrupting dual deep features distilled by self-supervised ViTs, we are rewarded with remarkable black-box transferability to models of various architectures that outperform state-of-the-arts. Code available at https://github.com/spencerwooo/dSVA.

[216] DDL: A Large-Scale Datasets for Deepfake Detection and Localization in Diversified Real-World Scenarios

Changtao Miao, Yi Zhang, Weize Gao, Zhiya Tan, Weiwei Feng, Man Luo, Jianshu Li, Ajian Liu, Yunfeng Diao, Qi Chu, Tao Gong, Zhe Li, Weibin Yao, Joey Tianyi Zhou

Main category: cs.CV

TL;DR: The paper introduces a large-scale deepfake detection and localization (DDL) dataset with 1.4M+ forged samples covering 80 deepfake methods, designed to address limitations in existing datasets and support more reliable detection methods with better interpretability.

DetailsMotivation: Current deepfake detection methods lack interpretability and practical effectiveness due to limitations in existing datasets, which have binary labels, limited forgery scenarios, insufficient diversity in deepfake types, and small data scales.

Method: Constructed a novel DDL dataset with four key innovations: comprehensive deepfake methods (80 methods across 7 architectures), varied manipulation modes (7 classic + 3 novel modes), diverse forgery scenarios and modalities (3 scenarios, 3 modalities), and fine-grained annotations (1.18M+ spatial masks, 0.23M+ temporal segments).

Result: Created a large-scale dataset containing over 1.4M forged samples with comprehensive coverage of deepfake techniques and detailed annotations, providing a more challenging benchmark for real-world forgery detection.

Conclusion: The DDL dataset addresses critical limitations in existing deepfake detection resources and provides essential support for developing next-generation detection, localization, and interpretability methods that can handle complex real-world scenarios.

Abstract: Recent advances in AIGC have exacerbated the misuse of malicious deepfake content, making the development of reliable deepfake detection methods an essential means to address this challenge. Although existing deepfake detection models demonstrate outstanding performance in detection metrics, most methods only provide simple binary classification results, lacking interpretability. Recent studies have attempted to enhance the interpretability of classification results by providing spatial manipulation masks or temporal forgery segments. However, due to the limitations of forgery datasets, the practical effectiveness of these methods remains suboptimal. The primary reason lies in the fact that most existing deepfake datasets contain only binary labels, with limited variety in forgery scenarios, insufficient diversity in deepfake types, and relatively small data scales, making them inadequate for complex real-world scenarios.To address this predicament, we construct a novel large-scale deepfake detection and localization (\textbf{DDL}) dataset containing over $\textbf{1.4M+}$ forged samples and encompassing up to $\textbf{80}$ distinct deepfake methods. The DDL design incorporates four key innovations: (1) \textbf{Comprehensive Deepfake Methods} (covering 7 different generation architectures and a total of 80 methods), (2) \textbf{Varied Manipulation Modes} (incorporating 7 classic and 3 novel forgery modes), (3) \textbf{Diverse Forgery Scenarios and Modalities} (including 3 scenarios and 3 modalities), and (4) \textbf{Fine-grained Forgery Annotations} (providing 1.18M+ precise spatial masks and 0.23M+ precise temporal segments).Through these improvements, our DDL not only provides a more challenging benchmark for complex real-world forgeries but also offers crucial support for building next-generation deepfake detection, localization, and interpretability methods.

[217] ScoreAdv: Score-based Targeted Generation of Natural Adversarial Examples via Diffusion Models

Chihan Huang, Hao Tang

Main category: cs.CV

TL;DR: ScoreAdv is a novel diffusion-based method for generating natural, unrestricted adversarial examples that achieves state-of-the-art attack success rates and image quality while maintaining efficiency and robustness.

DetailsMotivation: Existing adversarial attack methods rely on ℓp-norm constraints that don't align with human perception, and current diffusion-based approaches don't fully leverage denoising capabilities while GAN-based methods suffer from poor image quality due to instability and mode collapse.

Method: ScoreAdv uses an interpretable adversarial guidance mechanism to shift sampling distribution towards adversarial distribution, combined with saliency maps to inject visual information from reference images. It dynamically balances denoising and adversarial perturbation.

Result: Extensive experiments on ImageNet and CelebA show ScoreAdv achieves SOTA attack success rates and image quality across 10 target models in both black-box and white-box settings, with maintained inference efficiency.

Conclusion: ScoreAdv effectively generates unlimited natural adversarial examples that can attack both classification and retrieval models, remaining robust under defensive measures through its dynamic balance mechanism.

Abstract: Despite the success of deep learning across various domains, it remains vulnerable to adversarial attacks. Although many existing adversarial attack methods achieve high success rates, they typically rely on $\ell_{p}$-norm perturbation constraints, which do not align with human perceptual capabilities. Consequently, researchers have shifted their focus toward generating natural, unrestricted adversarial examples (UAEs). GAN-based approaches suffer from inherent limitations, such as poor image quality due to instability and mode collapse. Meanwhile, diffusion models have been employed for UAE generation, but they still rely on iterative PGD perturbation injection, without fully leveraging their central denoising capabilities. In this paper, we introduce a novel approach for generating UAEs based on diffusion models, named ScoreAdv. This method incorporates an interpretable adversarial guidance mechanism to gradually shift the sampling distribution towards the adversarial distribution, while using an interpretable saliency map to inject the visual information of a reference image into the generated samples. Notably, our method is capable of generating an unlimited number of natural adversarial examples and can attack not only classification models but also retrieval models. We conduct extensive experiments on ImageNet and CelebA datasets, validating the performance of ScoreAdv across ten target models in both black-box and white-box settings. Our results demonstrate that ScoreAdv achieves state-of-the-art attack success rates and image quality, while maintaining inference efficiency. Furthermore, the dynamic balance between denoising and adversarial perturbation enables ScoreAdv to remain robust even under defensive measures.

[218] From One to More: Contextual Part Latents for 3D Generation

Shaocong Dong, Lihe Ding, Xiao Chen, Yaokun Li, Yuxin Wang, Yucheng Wang, Qi Wang, Jaehyeok Kim, Chenjian Gao, Zhanpeng Huang, Zibin Wang, Tianfan Xue, Dan Xu

Main category: cs.CV

TL;DR: CoPart is a part-aware diffusion framework that decomposes 3D objects into contextual part latents for coherent multi-part generation, addressing limitations of single-latent representations in current 3D generation methods.

DetailsMotivation: Current 3D generation methods have three key limitations: single-latent representations fail to capture complex multi-part geometries, holistic latent coding neglects part independence and interrelationships, and global conditioning lacks fine-grained controllability.

Method: The framework decomposes 3D objects into contextual part latents, uses a mutual guidance strategy to fine-tune pre-trained diffusion models for joint part latent denoising, and is trained on Partverse - a novel 3D part dataset created from Objaverse with automated mesh segmentation and human-verified annotations.

Result: Extensive experiments demonstrate CoPart’s superior capabilities in part-level editing, articulated object generation, and scene composition with unprecedented controllability.

Conclusion: CoPart successfully addresses the limitations of current 3D generation methods by enabling part-aware generation with explicit relationship modeling and fine-grained controllability, inspired by human 3D design workflows.

Abstract: Recent advances in 3D generation have transitioned from multi-view 2D rendering approaches to 3D-native latent diffusion frameworks that exploit geometric priors in ground truth data. Despite progress, three key limitations persist: (1) Single-latent representations fail to capture complex multi-part geometries, causing detail degradation; (2) Holistic latent coding neglects part independence and interrelationships critical for compositional design; (3) Global conditioning mechanisms lack fine-grained controllability. Inspired by human 3D design workflows, we propose CoPart - a part-aware diffusion framework that decomposes 3D objects into contextual part latents for coherent multi-part generation. This paradigm offers three advantages: i) Reduces encoding complexity through part decomposition; ii) Enables explicit part relationship modeling; iii) Supports part-level conditioning. We further develop a mutual guidance strategy to fine-tune pre-trained diffusion models for joint part latent denoising, ensuring both geometric coherence and foundation model priors. To enable large-scale training, we construct Partverse - a novel 3D part dataset derived from Objaverse through automated mesh segmentation and human-verified annotations. Extensive experiments demonstrate CoPart’s superior capabilities in part-level editing, articulated object generation, and scene composition with unprecedented controllability.

[219] Predicting Video Slot Attention Queries from Random Slot-Feature Pairs

Rongzhen Zhao, Jian Li, Juho Kannala, Joni Pajarinen

Main category: cs.CV

TL;DR: RandSF.Q improves unsupervised video object-centric learning by incorporating next frame features and learning transition dynamics through random slot-feature pair sampling.

DetailsMotivation: Existing video OCL methods neglect next frame features (most informative for query prediction) and fail to learn transition dynamics (essential knowledge for query prediction).

Method: Proposes Random Slot-Feature pair for learning Query prediction (RandSF.Q): (1) new transitioner incorporates both slots and features for better query prediction, (2) trains transitioner using randomly sampled slot-feature pairs to learn transition dynamics.

Result: Significantly surpasses existing video OCL methods, achieving up to 10 points improvement on object discovery and setting new state-of-the-art. Also benefits downstream tasks like dynamics modeling.

Conclusion: RandSF.Q effectively addresses key limitations in current video OCL methods by incorporating next frame features and learning transition dynamics, leading to superior performance in object discovery and scene representation.

Abstract: Unsupervised video Object-Centric Learning (OCL) is promising as it enables object-level scene representation and dynamics modeling as we humans do. Mainstream video OCL methods adopt a recurrent architecture: An aggregator aggregates current video frame into object features, termed slots, under some queries; A transitioner transits current slots to queries for the next frame. This is an effective architecture but all existing implementations both (\textit{i1}) neglect to incorporate next frame features, the most informative source for query prediction, and (\textit{i2}) fail to learn transition dynamics, the knowledge essential for query prediction. To address these issues, we propose Random Slot-Feature pair for learning Query prediction (RandSF.Q): (\textit{t1}) We design a new transitioner to incorporate both slots and features, which provides more information for query prediction; (\textit{t2}) We train the transitioner to predict queries from slot-feature pairs randomly sampled from available recurrences, which drives it to learn transition dynamics. Experiments on scene representation demonstrate that our method surpass existing video OCL methods significantly, e.g., up to 10 points on object discovery, setting new state-of-the-art. Such superiority also benefits downstream tasks like dynamics modeling. Our core source code, model checkpoints and training logs are available on https://github.com/Genera1Z/RandSF.Q.

[220] Smoothing Slot Attention Iterations and Recurrences

Rongzhen Zhao, Wenyan Yang, Juho Kannala, Joni Pajarinen

Main category: cs.CV

TL;DR: SmoothSA improves Slot Attention by preheating cold-start queries for better first-frame aggregation and differentiating transforms between first and non-first frames in videos.

DetailsMotivation: Standard Slot Attention has issues with cold-start queries lacking sample-specific cues on first frames, and uses the same transforms for both first and non-first frames despite their different characteristics.

Method: SmoothSA preheats cold-start queries using input features via self-distillation, and uses full iterations for first frames but single iterations for non-first frames in videos.

Result: Comprehensive experiments show effectiveness on object discovery, recognition and downstream benchmarks, with intuitive analyses demonstrating how the method smooths SA iterations and recurrences.

Conclusion: SmoothSA successfully addresses the limitations of standard Slot Attention by smoothing both iterations within frames and recurrences across frames, improving object-centric learning performance.

Abstract: Slot Attention (SA) and its variants lie at the heart of mainstream Object-Centric Learning (OCL). Objects in an image can be aggregated into respective slot vectors, by \textit{iteratively} refining cold-start query vectors, typically three times, via SA on image features. For video, such aggregation is \textit{recurrently} shared across frames, with queries cold-started on the first frame while transitioned from the previous frame’s slots on non-first frames. However, the cold-start queries lack sample-specific cues thus hinder precise aggregation on the image or video’s first frame; Also, non-first frames’ queries are already sample-specific thus require transforms different from the first frame’s aggregation. We address these issues for the first time with our \textit{SmoothSA}: (1) To smooth SA iterations on the image or video’s first frame, we \textit{preheat} the cold-start queries with rich information of input features, via a tiny module self-distilled inside OCL; (2) To smooth SA recurrences across all video frames, we \textit{differentiate} the homogeneous transforms on the first and non-first frames, by using full and single iterations respectively. Comprehensive experiments on object discovery, recognition and downstream benchmarks validate our method’s effectiveness. Further analyses intuitively illuminate how our method smooths SA iterations and recurrences. Our source code, model checkpoints and training logs are available on https://github.com/Genera1Z/SmoothSA.

[221] Omni-Effects: Unified and Spatially-Controllable Visual Effects Generation

Fangyuan Mao, Aiming Hao, Jintao Chen, Dongxia Liu, Xiaokun Feng, Jiashu Zhu, Meiqi Wu, Chubin Chen, Jiahong Wu, Xiangxiang Chu

Main category: cs.CV

TL;DR: Omni-Effects is a unified framework that enables spatially controllable composite visual effects generation, overcoming limitations of current single-effect LoRA training methods through LoRA-MoE and Spatial-Aware Prompt innovations.

DetailsMotivation: Current VFX generation methods are limited to single effects per LoRA training, which prevents spatially controllable composite effects where multiple effects need to be generated concurrently at designated locations.

Method: Proposes LoRA-based Mixture of Experts (LoRA-MoE) to integrate diverse effects while mitigating interference, and Spatial-Aware Prompt (SAP) with Independent-Information Flow module for precise spatial control and effect isolation.

Result: Extensive experiments show Omni-Effects achieves precise spatial control and diverse effect generation, allowing users to specify both category and location of desired effects.

Conclusion: The framework successfully enables unified generation of prompt-guided effects and spatially controllable composite effects, advancing VFX production capabilities.

Abstract: Visual effects (VFX) are essential visual enhancements fundamental to modern cinematic production. Although video generation models offer cost-efficient solutions for VFX production, current methods are constrained by per-effect LoRA training, which limits generation to single effects. This fundamental limitation impedes applications that require spatially controllable composite effects, i.e., the concurrent generation of multiple effects at designated locations. However, integrating diverse effects into a unified framework faces major challenges: interference from effect variations and spatial uncontrollability during multi-VFX joint training. To tackle these challenges, we propose Omni-Effects, a first unified framework capable of generating prompt-guided effects and spatially controllable composite effects. The core of our framework comprises two key innovations: (1) LoRA-based Mixture of Experts (LoRA-MoE), which employs a group of expert LoRAs, integrating diverse effects within a unified model while effectively mitigating cross-task interference. (2) Spatial-Aware Prompt (SAP) incorporates spatial mask information into the text token, enabling precise spatial control. Furthermore, we introduce an Independent-Information Flow (IIF) module integrated within the SAP, isolating the control signals corresponding to individual effects to prevent any unwanted blending. To facilitate this research, we construct a comprehensive VFX dataset Omni-VFX via a novel data collection pipeline combining image editing and First-Last Frame-to-Video (FLF2V) synthesis, and introduce a dedicated VFX evaluation framework for validating model performance. Extensive experiments demonstrate that Omni-Effects achieves precise spatial control and diverse effect generation, enabling users to specify both the category and location of desired effects.

[222] KARMA: Efficient Structural Defect Segmentation via Kolmogorov-Arnold Representation Learning

Md Meftahul Ferdaus, Mahdi Abdelguerfi, Elias Ioup, Steven Sloan, Kendall N. Niles, Ken Pathak

Main category: cs.CV

TL;DR: KARMA is an efficient semantic segmentation framework for structural defect detection that uses Kolmogorov-Arnold representation with only 0.959M parameters, achieving competitive performance while being 97% smaller than state-of-the-art methods.

DetailsMotivation: Current deep learning methods for semantic segmentation of structural defects require millions of parameters, making them impractical for real-time inspection systems due to computational constraints.

Method: KARMA uses three innovations: (1) TiKAN module with low-rank factorization for KAN-based feature transformation, (2) optimized feature pyramid with separable convolutions for multi-scale analysis, and (3) static-dynamic prototype mechanism for imbalanced class handling.

Result: KARMA achieves competitive mean IoU performance with only 0.959M parameters (97% reduction from 31.04M) and operates at 0.264 GFLOPS, enabling real-time deployment while maintaining accuracy.

Conclusion: KARMA enables practical automated infrastructure inspection systems by providing efficient semantic segmentation suitable for real-time deployment without compromising accuracy.

Abstract: Semantic segmentation of structural defects in civil infrastructure remains challenging due to variable defect appearances, harsh imaging conditions, and significant class imbalance. Current deep learning methods, despite their effectiveness, typically require millions of parameters, rendering them impractical for real-time inspection systems. We introduce KARMA (Kolmogorov-Arnold Representation Mapping Architecture), a highly efficient semantic segmentation framework that models complex defect patterns through compositions of one-dimensional functions rather than conventional convolutions. KARMA features three technical innovations: (1) a parameter-efficient Tiny Kolmogorov-Arnold Network (TiKAN) module leveraging low-rank factorization for KAN-based feature transformation; (2) an optimized feature pyramid structure with separable convolutions for multi-scale defect analysis; and (3) a static-dynamic prototype mechanism that enhances feature representation for imbalanced classes. Extensive experiments on benchmark infrastructure inspection datasets demonstrate that KARMA achieves competitive or superior mean IoU performance compared to state-of-the-art approaches, while using significantly fewer parameters (0.959M vs. 31.04M, a 97% reduction). Operating at 0.264 GFLOPS, KARMA maintains inference speeds suitable for real-time deployment, enabling practical automated infrastructure inspection systems without compromising accuracy. The source code can be accessed at the following URL: https://github.com/faeyelab/karma.

[223] HM-Talker: Hybrid Motion Modeling for High-Fidelity Talking Head Synthesis

Shiyu Liu, Kui Jiang, Xianming Liu, Hongxun Yao, Xiaocheng Feng

Main category: cs.CV

TL;DR: HM-Talker is a novel framework for high-fidelity talking head video generation that combines implicit and explicit motion representations using Action Units to overcome motion blur and lip jitter issues in current methods.

DetailsMotivation: Current audio-driven talking head methods suffer from motion blur and lip jitter due to reliance on implicit modeling of audio-facial correlations without explicit articulatory priors.

Method: Proposes HM-Talker with: 1) Cross-Modal Disentanglement Module (CMDM) to extract complementary implicit/explicit motion features and predict AUs from audio, 2) Hybrid Motion Modeling Module (HMMM) that dynamically merges randomly paired features for identity-agnostic learning.

Result: Extensive experiments show HM-Talker’s superiority over state-of-the-art methods in visual quality and lip-sync accuracy, enabling robust lip synchronization across diverse identities.

Conclusion: HM-Talker advances personalized talking head synthesis by effectively combining implicit and explicit motion representations with identity-agnostic learning.

Abstract: Audio-driven talking head video generation enhances user engagement in human-computer interaction. However, current methods frequently produce videos with motion blur and lip jitter, primarily due to their reliance on implicit modeling of audio-facial motion correlations–an approach lacking explicit articulatory priors (i.e., anatomical guidance for speech-related facial movements). To overcome this limitation, we propose HM-Talker, a novel framework for generating high-fidelity, temporally coherent talking heads. HM-Talker leverages a hybrid motion representation combining both implicit and explicit motion cues. Explicit cues use Action Units (AUs), anatomically defined facial muscle movements, alongside implicit features to minimize phoneme-viseme misalignment. Specifically, our Cross-Modal Disentanglement Module (CMDM) extracts complementary implicit/explicit motion features while predicting AUs directly from audio input aligned to visual cues. To mitigate identity-dependent biases in explicit features and enhance cross-subject generalization, we introduce the Hybrid Motion Modeling Module (HMMM). This module dynamically merges randomly paired implicit/explicit features, enforcing identity-agnostic learning. Together, these components enable robust lip synchronization across diverse identities, advancing personalized talking head synthesis. Extensive experiments demonstrate HM-Talker’s superiority over state-of-the-art methods in visual quality and lip-sync accuracy.

[224] D-HUMOR: Dark Humor Understanding via Multimodal Open-ended Reasoning – A Benchmark Dataset and Method

Sai Kartheek Reddy Kasu, Mohammad Zia Ur Rehman, Shahid Shafi Dar, Rishi Bharat Junghare, Dhanvin Sanjay Namboodiri, Nagendra Kumar

Main category: cs.CV

TL;DR: A novel framework for detecting dark humor in memes using multimodal reasoning with VLMs and cross-reasoning networks, achieving state-of-the-art performance on detection, target identification, and intensity prediction tasks.

DetailsMotivation: Dark humor in online memes presents unique challenges due to implicit, sensitive, and culturally contextual cues, with existing methods lacking resources and approaches for multimodal dark humor detection.

Method: Proposes a reasoning-augmented framework using VLMs for structured explanations via Role-Reversal Self-Loop, extracting features from OCR, reasoning, and visual content, then fusing them through Tri-stream Cross-Reasoning Network (TCRNet).

Result: Outperforms strong baselines across three tasks: dark humor detection, target identification, and intensity prediction, using a novel dataset of 4,379 annotated Reddit memes.

Conclusion: The approach effectively addresses multimodal dark humor understanding through structured reasoning and cross-modal fusion, with released dataset and code to advance research in humor understanding and content moderation.

Abstract: Dark humor in online memes poses unique challenges due to its reliance on implicit, sensitive, and culturally contextual cues. To address the lack of resources and methods for detecting dark humor in multimodal content, we introduce a novel dataset of 4,379 Reddit memes annotated for dark humor, target category (gender, mental health, violence, race, disability, and other), and a three-level intensity rating (mild, moderate, severe). Building on this resource, we propose a reasoning-augmented framework that first generates structured explanations for each meme using a Large Vision-Language Model (VLM). Through a Role-Reversal Self-Loop, VLM adopts the author’s perspective to iteratively refine its explanations, ensuring completeness and alignment. We then extract textual features from both the OCR transcript and the self-refined reasoning via a text encoder, while visual features are obtained using a vision transformer. A Tri-stream Cross-Reasoning Network (TCRNet) fuses these three streams, text, image, and reasoning, via pairwise attention mechanisms, producing a unified representation for classification. Experimental results demonstrate that our approach outperforms strong baselines across three tasks: dark humor detection, target identification, and intensity prediction. The dataset, annotations, and code are released to facilitate further research in multimodal humor understanding and content moderation. Code and Dataset are available at: https://github.com/Sai-Kartheek-Reddy/D-Humor-Dark-Humor-Understanding-via-Multimodal-Open-ended-Reasoning

[225] Locality in Image Diffusion Models Emerges from Data Statistics

Artem Lukoianov, Chenyang Yuan, Justin Solomon, Vincent Sitzmann

Main category: cs.CV

TL;DR: The paper shows that locality in diffusion models emerges from statistical properties of image datasets rather than neural network inductive biases, and that pixel correlations drive this local behavior.

DetailsMotivation: To understand why diffusion models learn local behavior and what factors govern locality patterns, challenging the previous belief that convolutional neural network inductive bias is the main cause.

Method: Demonstrated that optimal parametric linear denoisers exhibit similar locality to deep neural denoisers, analyzed locality theoretically and experimentally through pixel correlations, and crafted an analytical denoiser based on data covariance.

Result: Found that locality arises directly from pixel correlations in image datasets, with different patterns on specialized datasets approximating principal components of data covariance. The analytical denoiser better matches deep diffusion model scores than prior alternatives.

Conclusion: Neural network architectures influence generation quality but their primary role is to capture locality patterns inherent in the data, not create them through inductive bias.

Abstract: Recent work has shown that the generalization ability of image diffusion models arises from the locality properties of the trained neural network. In particular, when denoising a particular pixel, the model relies on a limited neighborhood of the input image around that pixel, which, according to the previous work, is tightly related to the ability of these models to produce novel images. Since locality is central to generalization, it is crucial to understand why diffusion models learn local behavior in the first place, as well as the factors that govern the properties of locality patterns. In this work, we present evidence that the locality in deep diffusion models emerges as a statistical property of the image dataset and is not due to the inductive bias of convolutional neural networks, as suggested in previous work. Specifically, we demonstrate that an optimal parametric linear denoiser exhibits similar locality properties to deep neural denoisers. We show, both theoretically and experimentally, that this locality arises directly from pixel correlations present in the image datasets. Moreover, locality patterns are drastically different on specialized datasets, approximating principal components of the data’s covariance. We use these insights to craft an analytical denoiser that better matches scores predicted by a deep diffusion model than prior expert-crafted alternatives. Our key takeaway is that while neural network architectures influence generation quality, their primary role is to capture locality patterns inherent in the data.

[226] Tunable-Generalization Diffusion Powered by Self-Supervised Contextual Sub-Data for Low-Dose CT Reconstruction

Guoquan Wei, Liu Shi, Zekun Zhou, Wenzhe Shan, Qiegen Liu

Main category: cs.CV

TL;DR: TurnDiff is a self-supervised diffusion model for low-dose CT denoising that uses contextual sub-data and knowledge distillation to achieve superior reconstruction and generalization without requiring paired clean data.

DetailsMotivation: Current deep learning models for low-dose CT denoising rely on paired data and generalize poorly, while diffusion models need clean data distribution learning which is difficult in clinical applications. Self-supervised methods also face generalization challenges when expanding to other doses.

Method: Proposes TurnDiff with contextual subdata self-enhancing similarity strategy in projection domain, combines knowledge distillation with latent diffusion models for optimization, and uses pixel-level self-correcting fusion for fine-grained reconstruction. Dual-domain strategy cascade enables training and testing with only LDCT projection data.

Result: Comprehensive evaluation on benchmark datasets and real-world data demonstrates that TurnDiff consistently outperforms state-of-the-art methods in both reconstruction and generalization performance.

Conclusion: TurnDiff provides an effective self-supervised solution for low-dose CT denoising that achieves superior reconstruction quality and excellent generalization to different dose levels using only LDCT projection data.

Abstract: Current models based on deep learning for low-dose CT denoising rely heavily on paired data and generalize poorly. Even the more concerned diffusion models need to learn the distribution of clean data for reconstruction, which is difficult to satisfy in medical clinical applications. At the same time, self-supervised-based methods face the challenge of significant degradation of generalizability of models pre-trained for the current dose to expand to other doses. To address these issues, this work proposes a novel method of TUnable-geneRalizatioN Diffusion (TurnDiff) powered by self-supervised contextual sub-data for low-dose CT reconstruction. Firstly, a contextual subdata self-enhancing similarity strategy is designed for denoising centered on the LDCT projection domain, which provides an initial prior for the subsequent progress. Subsequently, the initial prior is used to combine knowledge distillation with a deep combination of latent diffusion models for optimizing image details. The pre-trained model is used for inference reconstruction, and the pixel-level self-correcting fusion technique is proposed for fine-grained reconstruction of the image domain to enhance the image fidelity, using the initial prior and the LDCT image as a guide. In addition, the technique is flexibly applied to the generalization of upper and lower doses or even unseen doses. Dual-domain strategy cascade for self-supervised LDCT denoising, TurnDiff requires only LDCT projection domain data for training and testing. Comprehensive evaluation on both benchmark datasets and real-world data demonstrates that TurnDiff consistently outperforms state-of-the-art methods in both reconstruction and generalization.

[227] Cycle Diffusion Model for Counterfactual Image Generation

Fangrui Huang, Alan Wang, Binxu Li, Bailey Trang, Ridvan Yesiloglu, Tianyu Hua, Wei Peng, Ehsan Adeli

Main category: cs.CV

TL;DR: Cycle Diffusion Model (CDM) uses cycle training to improve conditioning faithfulness and image quality in medical image generation with diffusion models.

DetailsMotivation: Ensuring conditioning faithfulness and high-quality synthetic images for direct or counterfactual generation in medical imaging remains challenging with current deep generative models.

Method: A cycle training framework that fine-tunes diffusion models by incorporating cycle constraints to enforce consistency between generated and original images.

Result: Experiments on 3D brain MRI datasets show improved conditioning accuracy and enhanced image quality as measured by FID and SSIM metrics.

Conclusion: The cycle strategy in CDM is effective for refining diffusion-based medical image generation, with applications in data augmentation, counterfactual analysis, and disease progression modeling.

Abstract: Deep generative models have demonstrated remarkable success in medical image synthesis. However, ensuring conditioning faithfulness and high-quality synthetic images for direct or counterfactual generation remains a challenge. In this work, we introduce a cycle training framework to fine-tune diffusion models for improved conditioning adherence and enhanced synthetic image realism. Our approach, Cycle Diffusion Model (CDM), enforces consistency between generated and original images by incorporating cycle constraints, enabling more reliable direct and counterfactual generation. Experiments on a combined 3D brain MRI dataset (from ABCD, HCP aging & young adults, ADNI, and PPMI) show that our method improves conditioning accuracy and enhances image quality as measured by FID and SSIM. The results suggest that the cycle strategy used in CDM can be an effective method for refining diffusion-based medical image generation, with applications in data augmentation, counterfactual, and disease progression modeling.

[228] LinearSR: Unlocking Linear Attention for Stable and Efficient Image Super-Resolution

Xiaohui Li, Shaobin Zhuang, Shuo Cao, Yang Yang, Yuandong Pu, Qi Qin, Siqi Luo, Bin Fu, Yihao Liu

Main category: cs.CV

TL;DR: LinearSR introduces a holistic framework that overcomes key challenges in using Linear Attention for photorealistic image super-resolution, achieving state-of-the-art perceptual quality with exceptional efficiency.

DetailsMotivation: Generative models for Image Super-Resolution rely on self-attention with quadratic complexity, creating computational bottlenecks. Linear Attention offers linear complexity but has been hindered by unsolved challenges in photorealistic SR applications.

Method: The framework uses: 1) ESGF strategy to resolve training instability, 2) SNR-based Mixture of Experts to mitigate perception-distortion trade-off, and 3) TAG guidance paradigm based on ‘precision-over-volume’ principle.

Result: LinearSR achieves state-of-the-art perceptual quality with exceptional efficiency. Its core diffusion forward pass achieves SOTA-level speed while maintaining competitive multi-step inference time.

Conclusion: This work provides the first robust methodology for applying Linear Attention in photorealistic SR domain, establishing a foundational paradigm for future research in efficient generative super-resolution.

Abstract: Generative models for Image Super-Resolution (SR) are increasingly powerful, yet their reliance on self-attention’s quadratic complexity (O(N^2)) creates a major computational bottleneck. Linear Attention offers an O(N) solution, but its promise for photorealistic SR has remained largely untapped, historically hindered by a cascade of interrelated and previously unsolved challenges. This paper introduces LinearSR, a holistic framework that, for the first time, systematically overcomes these critical hurdles. Specifically, we resolve a fundamental, training instability that causes catastrophic model divergence using our novel “knee point”-based Early-Stopping Guided Fine-tuning (ESGF) strategy. Furthermore, we mitigate the classic perception-distortion trade-off with a dedicated SNR-based Mixture of Experts (MoE) architecture. Finally, we establish an effective and lightweight guidance paradigm, TAG, derived from our “precision-over-volume” principle. Our resulting LinearSR model simultaneously delivers state-of-the-art perceptual quality with exceptional efficiency. Its core diffusion forward pass (1-NFE) achieves SOTA-level speed, while its overall multi-step inference time remains highly competitive. This work provides the first robust methodology for applying Linear Attention in the photorealistic SR domain, establishing a foundational paradigm for future research in efficient generative super-resolution.

[229] Real-Time Neural Video Compression with Unified Intra and Inter Coding

Hui Xiang, Yifan Bian, Li Li, Jingran Wu, Xianguo Zhang, Dong Liu

Main category: cs.CV

TL;DR: A neural video compression framework with unified intra/inter coding that adaptively handles disocclusion and new content while preventing error propagation, achieving 12.1% BD-rate reduction over DCVC-RT with real-time performance.

DetailsMotivation: Existing neural video compression schemes have limitations in handling disocclusion, new content, and interframe error propagation. The authors aim to eliminate these issues by borrowing intra coding concepts from classic video coding.

Method: Proposed an NVC framework with unified intra and inter coding using a single adaptive model, plus simultaneous two-frame compression to exploit both forward and backward interframe redundancy.

Result: Outperforms DCVC-RT by 12.1% average BD-rate reduction, delivers more stable bitrate and quality per frame, and maintains real-time encoding/decoding performance.

Conclusion: The proposed framework successfully addresses key limitations of existing NVC schemes while maintaining real-time performance, with code and models to be released.

Abstract: Neural video compression (NVC) technologies have advanced rapidly in recent years, yielding state-of-the-art schemes such as DCVC-RT that offer superior compression efficiency to H.266/VVC and real-time encoding/decoding capabilities. Nonetheless, existing NVC schemes have several limitations, including inefficiency in dealing with disocclusion and new content, interframe error propagation and accumulation, among others. To eliminate these limitations, we borrow the idea from classic video coding schemes, which allow intra coding within inter-coded frames. With the intra coding tool enabled, disocclusion and new content are properly handled, and interframe error propagation is naturally intercepted without the need for manual refresh mechanisms. We present an NVC framework with unified intra and inter coding, where every frame is processed by a single model that is trained to perform intra/inter coding adaptively. Moreover, we propose a simultaneous two-frame compression design to exploit interframe redundancy not only forwardly but also backwardly. Experimental results show that our scheme outperforms DCVC-RT by an average of 12.1% BD-rate reduction, delivers more stable bitrate and quality per frame, and retains real-time encoding/decoding performances. Code and models will be released.

[230] MaskCaptioner: Learning to Jointly Segment and Caption Object Trajectories in Videos

Gabriel Fiastre, Antoine Yang, Cordelia Schmid

Main category: cs.CV

TL;DR: Proposes MaskCaptioner, an end-to-end model for dense video object captioning that uses synthetic captions from a VLM to train on extended datasets LVISCap and LV-VISCap, achieving SOTA results on multiple benchmarks.

DetailsMotivation: Previous approaches use disjoint training strategies due to high manual annotation costs, leading to suboptimal performance in dense video object captioning tasks.

Method: Generate synthetic captions about spatio-temporally localized entities using a state-of-the-art VLM, extend LVIS and LV-VIS datasets with these captions, and train MaskCaptioner for joint detection, segmentation, tracking and captioning.

Result: MaskCaptioner achieves state-of-the-art results on three benchmarks: VidSTG, VLN and BenSMOT after pretraining on LVISCap and LV-VISCap datasets.

Conclusion: The proposed approach effectively addresses the annotation cost issue in DVOC through synthetic caption generation and enables end-to-end training for improved performance.

Abstract: Dense Video Object Captioning (DVOC) is the task of jointly detecting, tracking, and captioning object trajectories in a video, requiring the ability to understand spatio-temporal details and describe them in natural language. Due to the complexity of the task and the high cost associated with manual annotation, previous approaches resort to disjoint training strategies, potentially leading to suboptimal performance. To circumvent this issue, we propose to generate captions about spatio-temporally localized entities leveraging a state-of-the-art VLM. By extending the LVIS and LV-VIS datasets with our synthetic captions (LVISCap and LV-VISCap), we train MaskCaptioner, an end-to-end model capable of jointly detecting, segmenting, tracking and captioning object trajectories. Moreover, with pretraining on LVISCap and LV-VISCap, MaskCaptioner achieves state-of-the-art DVOC results on three existing benchmarks, VidSTG, VLN and BenSMOT. The datasets and code are available at https://www.gabriel.fiastre.fr/maskcaptioner/.

[231] SPLite Hand: Sparsity-Aware Lightweight 3D Hand Pose Estimation

Yeh Keng Hao, Hsu Tzu Wei, Sun Min

Main category: cs.CV

TL;DR: A lightweight framework for AR/VR edge devices using encoder-decoder architecture with sparse convolution, SPLite decoder, and quantization-aware training achieves 2.98x speed-up on Raspberry Pi 5 while maintaining accuracy comparable to state-of-the-art methods.

DetailsMotivation: Address the challenge of deploying deep learning models on AR/VR edge devices requiring real-time inference, low power consumption, and minimal latency while balancing efficiency and performance.

Method: Encoder-decoder architecture with sparse convolution on ResNet-18 backbone to exploit sparsity in hand pose images, SPLite decoder for faster processing, and quantization-aware training for memory optimization.

Result: 42% end-to-end efficiency improvement, 3.1x frame rate boost on Raspberry Pi 5, 2.98x overall speed-up, with minimal accuracy impact (PA-MPJPE increased only from 9.0 mm to 9.1 mm on FreiHAND).

Conclusion: The proposed framework successfully achieves significant computational efficiency improvements for edge devices while maintaining comparable accuracy to state-of-the-art approaches, making it suitable for AR/VR applications.

Abstract: With the increasing ubiquity of AR/VR devices, the deployment of deep learning models on edge devices has become a critical challenge. These devices require real-time inference, low power consumption, and minimal latency. Many framework designers face the conundrum of balancing efficiency and performance. We design a light framework that adopts an encoder-decoder architecture and introduces several key contributions aimed at improving both efficiency and accuracy. We apply sparse convolution on a ResNet-18 backbone to exploit the inherent sparsity in hand pose images, achieving a 42% end-to-end efficiency improvement. Moreover, we propose our SPLite decoder. This new architecture significantly boosts the decoding process’s frame rate by 3.1x on the Raspberry Pi 5, while maintaining accuracy on par. To further optimize performance, we apply quantization-aware training, reducing memory usage while preserving accuracy (PA-MPJPE increases only marginally from 9.0 mm to 9.1 mm on FreiHAND). Overall, our system achieves a 2.98x speed-up on a Raspberry Pi 5 CPU (BCM2712 quad-core Arm A76 processor). Our method is also evaluated on compound benchmark datasets, demonstrating comparable accuracy to state-of-the-art approaches while significantly enhancing computational efficiency.

[232] Fit for Purpose? Deepfake Detection in the Real World

Guangyu Lin, Li Lin, Christina P. Walker, Daniel S. Schiff, Shu Hu

Main category: cs.CV

TL;DR: This paper introduces the first systematic benchmark using real-world political deepfakes from social media to evaluate deepfake detectors, finding that current models struggle with generalization and are vulnerable to simple manipulations.

DetailsMotivation: The proliferation of AI-generated content, especially political deepfakes, poses serious misinformation risks. Most existing detection models are trained on synthetic datasets and lack generalizability to real-world political deepfakes circulating on social media.

Method: Created a systematic benchmark based on the Political Deepfakes Incident Database - a curated collection of real-world political deepfakes from social media since 2018. Evaluated state-of-the-art deepfake detectors from academia, government, and industry.

Result: Academic and government detectors performed poorly. Paid tools achieved higher performance than free models, but all detectors struggled to generalize to authentic political deepfakes and were vulnerable to simple manipulations, especially in videos.

Conclusion: There is an urgent need for politically contextualized deepfake detection frameworks to better protect the public from real-world political deepfake threats.

Abstract: The rapid proliferation of AI-generated content, driven by advances in generative adversarial networks, diffusion models, and multimodal large language models, has made the creation and dissemination of synthetic media effortless, heightening the risks of misinformation, particularly political deepfakes that distort truth and undermine trust in political institutions. In turn, governments, research institutions, and industry have strongly promoted deepfake detection initiatives as solutions. Yet, most existing models are trained and validated on synthetic, laboratory-controlled datasets, limiting their generalizability to the kinds of real-world political deepfakes circulating on social platforms that affect the public. In this work, we introduce the first systematic benchmark based on the Political Deepfakes Incident Database, a curated collection of real-world political deepfakes shared on social media since 2018. Our study includes a systematic evaluation of state-of-the-art deepfake detectors across academia, government, and industry. We find that the detectors from academia and government perform relatively poorly. While paid detection tools achieve relatively higher performance than free-access models, all evaluated detectors struggle to generalize effectively to authentic political deepfakes, and are vulnerable to simple manipulations, especially in the video domain. Results urge the need for politically contextualized deepfake detection frameworks to better safeguard the public in real-world settings.

[233] GRPO-Guard: Mitigating Implicit Over-Optimization in Flow Matching via Regulated Clipping

Jing Wang, Jiajun Liang, Jie Liu, Henglin Liu, Gongye Liu, Jun Zheng, Wanyuan Pang, Ao Ma, Zhenyu Xie, Xintao Wang, Meng Wang, Pengfei Wan, Xiaodan Liang

Main category: cs.CV

TL;DR: GRPO-Guard addresses systematic importance-ratio distribution shifts in GRPO-based reinforcement learning for flow-matching models, preventing implicit over-optimization through ratio normalization and gradient reweighting.

DetailsMotivation: Existing GRPO frameworks suffer from systematic shifts in importance-ratio distribution (mean below 1, inconsistent variance across timesteps), which prevents proper constraint of overconfident positive updates and leads to implicit over-optimization where proxy rewards increase but essential metrics like image quality deteriorate.

Method: GRPO-Guard introduces two key components: (1) ratio normalization to restore balanced and step-consistent importance ratios, ensuring proper PPO clipping across denoising timesteps; (2) gradient reweighting to equalize policy gradients over noise conditions, preventing excessive updates from particular timestep regions.

Result: Extensive experiments on multiple diffusion backbones (SD3.5M, Flux.1-dev) and diverse proxy tasks show that GRPO-Guard significantly reduces over-optimization while maintaining or even improving generation quality.

Conclusion: GRPO-Guard provides a simple yet effective enhancement to GRPO frameworks that stabilizes optimization and mitigates implicit over-optimization without relying on heavy KL regularization, making the learned policies more practical for real-world use.

Abstract: Recently, GRPO-based reinforcement learning has shown remarkable progress in optimizing flow-matching models, effectively improving their alignment with task-specific rewards. Within these frameworks, the policy update relies on importance-ratio clipping to constrain overconfident positive and negative gradients. However, in practice, we observe a systematic shift in the importance-ratio distribution-its mean falls below 1 and its variance differs substantially across timesteps. This left-shifted and inconsistent distribution prevents positive-advantage samples from entering the clipped region, causing the mechanism to fail in constraining overconfident positive updates. As a result, the policy model inevitably enters an implicit over-optimization stage-while the proxy reward continues to increase, essential metrics such as image quality and text-prompt alignment deteriorate sharply, ultimately making the learned policy impractical for real-world use. To address this issue, we introduce GRPO-Guard, a simple yet effective enhancement to existing GRPO frameworks. Our method incorporates ratio normalization, which restores a balanced and step-consistent importance ratio, ensuring that PPO clipping properly constrains harmful updates across denoising timesteps. In addition, a gradient reweighting strategy equalizes policy gradients over noise conditions, preventing excessive updates from particular timestep regions. Together, these designs act as a regulated clipping mechanism, stabilizing optimization and substantially mitigating implicit over-optimization without relying on heavy KL regularization. Extensive experiments on multiple diffusion backbones (e.g., SD3.5M, Flux.1-dev) and diverse proxy tasks demonstrate that GRPO-Guard significantly reduces over-optimization while maintaining or even improving generation quality.

[234] FARMER: Flow AutoRegressive Transformer over Pixels

Guangting Zheng, Qinyu Zhao, Tao Yang, Fei Xiao, Zhijie Lin, Jie Wu, Jiajun Deng, Yanyong Zhang, Rui Zhu

Main category: cs.CV

TL;DR: FARMER is a generative framework that combines Normalizing Flows and Autoregressive models for tractable likelihood estimation and high-quality image synthesis directly from raw pixels.

DetailsMotivation: Direct autoregressive modeling of visual pixel data suffers from extremely long sequences and high-dimensional spaces, making it inefficient for image generation.

Method: Uses invertible autoregressive flow to transform images into latent sequences, employs self-supervised dimension reduction to partition latent channels, and implements one-step distillation with classifier-free guidance for efficient inference.

Result: Achieves competitive performance compared to existing pixel-based generative models while providing exact likelihoods and scalable training.

Conclusion: FARMER successfully unifies NF and AR models to overcome limitations of pixel-level autoregressive modeling, enabling efficient and high-quality image generation with tractable likelihoods.

Abstract: Directly modeling the explicit likelihood of the raw data distribution is key topic in the machine learning area, which achieves the scaling successes in Large Language Models by autoregressive modeling. However, continuous AR modeling over visual pixel data suffer from extremely long sequences and high-dimensional spaces. In this paper, we present FARMER, a novel end-to-end generative framework that unifies Normalizing Flows (NF) and Autoregressive (AR) models for tractable likelihood estimation and high-quality image synthesis directly from raw pixels. FARMER employs an invertible autoregressive flow to transform images into latent sequences, whose distribution is modeled implicitly by an autoregressive model. To address the redundancy and complexity in pixel-level modeling, we propose a self-supervised dimension reduction scheme that partitions NF latent channels into informative and redundant groups, enabling more effective and efficient AR modeling. Furthermore, we design a one-step distillation scheme to significantly accelerate inference speed and introduce a resampling-based classifier-free guidance algorithm to boost image generation quality. Extensive experiments demonstrate that FARMER achieves competitive performance compared to existing pixel-based generative models while providing exact likelihoods and scalable training.

[235] VC4VG: Optimizing Video Captions for Text-to-Video Generation

Yang Du, Zhuoran Lin, Kaiqiang Song, Biao Wang, Zhicheng Zheng, Tiezheng Ge, Bo Zheng, Qin Jin

Main category: cs.CV

TL;DR: VC4VG is a video caption optimization framework for text-to-video generation that analyzes caption requirements, proposes design methodology, and validates effectiveness through improved video generation performance.

DetailsMotivation: Current text-to-video generation lacks optimized video caption strategies, and high-quality video-text pairs are critical for training coherent and instruction-aligned video generation models.

Method: Proposed VC4VG framework that analyzes caption content from T2V perspective, decomposes essential elements for video reconstruction, and creates VC4VG-Bench benchmark with fine-grained, multi-dimensional metrics.

Result: Extensive T2V fine-tuning experiments show strong correlation between improved caption quality and video generation performance, validating the framework’s effectiveness.

Conclusion: The VC4VG framework successfully addresses the gap in video caption optimization for T2V generation and provides tools for further research advancement.

Abstract: Recent advances in text-to-video (T2V) generation highlight the critical role of high-quality video-text pairs in training models capable of producing coherent and instruction-aligned videos. However, strategies for optimizing video captions specifically for T2V training remain underexplored. In this paper, we introduce VC4VG (Video Captioning for Video Generation), a comprehensive caption optimization framework tailored to the needs of T2V models. We begin by analyzing caption content from a T2V perspective, decomposing the essential elements required for video reconstruction into multiple dimensions, and proposing a principled caption design methodology. To support evaluation, we construct VC4VG-Bench, a new benchmark featuring fine-grained, multi-dimensional, and necessity-graded metrics aligned with T2V-specific requirements. Extensive T2V fine-tuning experiments demonstrate a strong correlation between improved caption quality and video generation performance, validating the effectiveness of our approach. We release all benchmark tools and code at https://github.com/alimama-creative/VC4VG to support further research.

[236] Reasoning Visual Language Model for Chest X-Ray Analysis

Andriy Myronenko, Dong Yang, Baris Turkbey, Mariam Aboian, Sena Azamat, Esra Akcicek, Hongxu Yin, Pavlo Molchanov, Marc Edgar, Yufan He, Pengfei Guo, Yucheng Tang, Daguang Xu

Main category: cs.CV

TL;DR: A framework that brings chain-of-thought reasoning to chest X-ray interpretation, making AI predictions more transparent and clinically auditable by aligning intermediate reasoning steps with observable image evidence and radiology workflow.

DetailsMotivation: Current vision-language models for medical image analysis are opaque and lack the transparent, stepwise reasoning that clinicians rely on for trust and verification.

Method: Couples high-fidelity visual encoding with two-stage training: reasoning-style supervised fine-tuning followed by reinforcement learning using verifiable rewards over X-ray abnormalities. The model outputs reasoning that mirrors radiologists’ systematic thought process, uncertainty, and differential diagnosis.

Result: Achieves competitive multi-label classification in out-of-distribution evaluation while improving interpretability. In reader studies with expert radiologists, full reasoning traces increased confidence, supported error auditing, and reduced time to finalize reports.

Conclusion: The approach enables quality assurance, error analysis, and safer human-AI collaboration in medical imaging, where reasoning quality is as critical as prediction quality. The model NV-Reason-CXR-3B is released to support progress toward trustworthy, explainable AI.

Abstract: Vision-language models (VLMs) have shown strong promise for medical image analysis, but most remain opaque, offering predictions without the transparent, stepwise reasoning clinicians rely on. We present a framework that brings chain-of-thought (CoT) reasoning to chest X-ray interpretation. Inspired by reasoning-first training paradigms, our approach is designed to learn how experts reason, not just what they conclude, by aligning intermediate steps with observable image evidence and radiology workflow. Beyond accuracy, the explicit reasoning traces support clinical auditability: they reveal why a conclusion was reached, which alternatives were considered, and where uncertainty remains, enabling quality assurance, error analysis, and safer human-AI collaboration. Our model couples high-fidelity visual encoding with a two-stage training recipe: a reasoning-style supervised fine-tuning (SFT) followed by reinforcement learning (RL) that uses verifiable rewards over a list of X-ray abnormalities. The model outputs reasoning that mirrors radiologists systematic thought process, uncertainty, and differential diagnosis. In out-of-distribution evaluation, the approach achieves competitive multi-label classification while improving interpretability. In a reader study with expert radiologists, full reasoning traces increased confidence, supported error auditing, and reduced time to finalize reports. We release code and the model NV-Reason-CXR-3B to support community progress toward trustworthy, explainable AI in chest radiography and other medical imaging tasks where reasoning quality is as critical as prediction quality.

[237] TeleEgo: Benchmarking Egocentric AI Assistants in the Wild

Jiaqi Yan, Ruilong Ren, Jingren Liu, Shuning Xu, Ling Wang, Yiheng Wang, Yun Wang, Long Zhang, Xiangyu Chen, Changzhi Sun, Jixiang Luo, Dell Zhang, Hao Sun, Chi Zhang, Xuelong Li

Main category: cs.CV

TL;DR: TeleEgo is a comprehensive benchmark for evaluating egocentric AI assistants with long-duration, streaming, multi-modal data across real-world daily activities, focusing on memory, understanding, and cross-memory reasoning capabilities.

DetailsMotivation: Existing benchmarks evaluate AI assistant capabilities in isolation, lack realistic streaming scenarios, or only support short-term tasks, failing to capture the real-world requirements of egocentric AI assistants that need to process multi-modal inputs in real-time with long-term memory retention.

Method: Created TeleEgo dataset with over 14 hours per participant of synchronized egocentric video, audio, and text across four domains (work & study, lifestyle & routines, social activities, outings & culture), aligned on a unified global timeline with human-refined visual narrations and speech transcripts. Includes 12 diagnostic subtasks across three core capabilities with 3,291 human-verified QA items evaluated in streaming setting.

Result: The benchmark provides realistic evaluation of AI assistants with two key metrics: Real-Time Accuracy (assessing correctness and temporal responsiveness) and Memory Persistence Time (assessing long-term retention).

Conclusion: TeleEgo offers a realistic and comprehensive evaluation framework to advance the development of practical AI assistants capable of handling real-world streaming scenarios with long-term memory capabilities.

Abstract: Egocentric AI assistants in real-world settings must process multi-modal inputs (video, audio, text), respond in real time, and retain evolving long-term memory. However, existing benchmarks typically evaluate these abilities in isolation, lack realistic streaming scenarios, or support only short-term tasks. We introduce \textbf{TeleEgo}, a long-duration, streaming, omni-modal benchmark for evaluating egocentric AI assistants in realistic daily contexts. The dataset features over 14 hours per participant of synchronized egocentric video, audio, and text across four domains: work & study, lifestyle & routines, social activities, and outings & culture. All data is aligned on a unified global timeline and includes high-quality visual narrations and speech transcripts, curated through human refinement.TeleEgo defines 12 diagnostic subtasks across three core capabilities: Memory (recalling past events), Understanding (interpreting the current moment), and Cross-Memory Reasoning (linking distant events). It contains 3,291 human-verified QA items spanning multiple question formats (single-choice, binary, multi-choice, and open-ended), evaluated strictly in a streaming setting. We propose two key metrics – Real-Time Accuracy and Memory Persistence Time – to jointly assess correctness, temporal responsiveness, and long-term retention. TeleEgo provides a realistic and comprehensive evaluation to advance the development of practical AI assistants.

[238] MMEdge: Accelerating On-device Multimodal Inference via Pipelined Sensing and Encoding

Runxi Huang, Mingxuan Yu, Mingyu Tsoi, Xiaomin Ouyang

Main category: cs.CV

TL;DR: MMEdge is a real-time multimodal inference framework for edge devices that uses pipelined sensing and encoding to reduce latency while maintaining accuracy through temporal aggregation and adaptive optimization.

DetailsMotivation: Enable real-time multimodal inference on resource-constrained edge devices by addressing the tight coupling between sensing dynamics and model execution, and complex inter-modality dependencies that prior work overlooked.

Method: Decomposes inference into fine-grained sensing/encoding units for incremental computation, uses temporal aggregation to capture dynamics, adaptive configuration optimizer for resource management, and cross-modal speculative skipping for early termination.

Result: Significantly reduces end-to-end latency while maintaining high task accuracy across various system and data dynamics, validated on public datasets and real-world UAV testbed.

Conclusion: MMEdge provides an effective solution for real-time multimodal edge inference through pipelined design and adaptive optimization, achieving both low latency and high accuracy.

Abstract: Real-time multimodal inference on resource-constrained edge devices is essential for applications such as autonomous driving, human-computer interaction, and mobile health. However, prior work often overlooks the tight coupling between sensing dynamics and model execution, as well as the complex inter-modality dependencies. In this paper, we propose MMEdge, an new on-device multi-modal inference framework based on pipelined sensing and encoding. Instead of waiting for complete sensor inputs, MMEdge decomposes the entire inference process into a sequence of fine-grained sensing and encoding units, allowing computation to proceed incrementally as data arrive. MMEdge also introduces a lightweight but effective temporal aggregation module that captures rich temporal dynamics across different pipelined units to maintain accuracy performance. Such pipelined design also opens up opportunities for fine-grained cross-modal optimization and early decision-making during inference. To further enhance system performance under resource variability and input data complexity, MMEdge incorporates an adaptive multimodal configuration optimizer that dynamically selects optimal sensing and model configurations for each modality under latency constraints, and a cross-modal speculative skipping mechanism that bypasses future units of slower modalities when early predictions reach sufficient confidence. We evaluate MMEdge using two public multimodal datasets and deploy it on a real-world unmanned aerial vehicle (UAV)-based multimodal testbed. The results show that MMEdge significantly reduces end-to-end latency while maintaining high task accuracy across various system and data dynamics.

cs.AI

[239] Towards Piece-by-Piece Explanations for Chess Positions with SHAP

Francesco Spinnato

Main category: cs.AI

TL;DR: Adapting SHAP to chess analysis to attribute engine evaluations to specific pieces by treating pieces as features and systematically ablating them.

DetailsMotivation: Chess engines provide opaque centipawn evaluations that obscure individual piece contributions, making it hard to understand the reasoning behind positions.

Method: Treat chess pieces as features and systematically ablate them to compute SHAP values, attributing engine evaluations to specific pieces in an additive, interpretable manner.

Result: Developed a method that provides locally faithful and human-interpretable piece-wise explanations of chess engine evaluations.

Conclusion: This approach enables better visualization, human training, and engine comparison in chess analysis, with released code and data to advance interpretable chess AI research.

Abstract: Contemporary chess engines offer precise yet opaque evaluations, typically expressed as centipawn scores. While effective for decision-making, these outputs obscure the underlying contributions of individual pieces or patterns. In this paper, we explore adapting SHAP (SHapley Additive exPlanations) to the domain of chess analysis, aiming to attribute a chess engines evaluation to specific pieces on the board. By treating pieces as features and systematically ablating them, we compute additive, per-piece contributions that explain the engines output in a locally faithful and human-interpretable manner. This method draws inspiration from classical chess pedagogy, where players assess positions by mentally removing pieces, and grounds it in modern explainable AI techniques. Our approach opens new possibilities for visualization, human training, and engine comparison. We release accompanying code and data to foster future research in interpretable chess AI.

[240] An Agentic Framework for Rapid Deployment of Edge AI Solutions in Industry 5.0

Jorge Martinez-Gil, Mario Pichler, Nefeli Bountouni, Sotiris Koussouris, Marielena Márquez Barreiro, Sergio Gusmeroli

Main category: cs.AI

TL;DR: A novel Industry 5.0 framework for simplified AI model deployment on edge devices with local inference and real-time processing capabilities.

DetailsMotivation: To address the need for simplified AI deployment in industrial settings while reducing latency and avoiding external data transfer through local processing.

Method: Agent-based implementation where individual agents (human, algorithmic, or collaborative) handle well-defined tasks, supporting modular integration with low resource requirements.

Result: Preliminary evaluations in food industry scenarios show improved deployment time and system adaptability performance.

Conclusion: The framework successfully enables flexible AI deployment on edge devices for Industry 5.0 applications, with source code publicly available.

Abstract: We present a novel framework for Industry 5.0 that simplifies the deployment of AI models on edge devices in various industrial settings. The design reduces latency and avoids external data transfer by enabling local inference and real-time processing. Our implementation is agent-based, which means that individual agents, whether human, algorithmic, or collaborative, are responsible for well-defined tasks, enabling flexibility and simplifying integration. Moreover, our framework supports modular integration and maintains low resource requirements. Preliminary evaluations concerning the food industry in real scenarios indicate improved deployment time and system adaptability performance. The source code is publicly available at https://github.com/AI-REDGIO-5-0/ci-component.

[241] Unveiling Intrinsic Text Bias in Multimodal Large Language Models through Attention Key-Space Analysis

Xinhan Zheng, Huyu Wu, Xueting Wang, Haiyun Jiang

Main category: cs.AI

TL;DR: MLLMs have text bias due to visual keys being out-of-distribution in attention space, not just from data imbalance.

DetailsMotivation: To understand why multimodal LLMs prefer text over visual inputs, moving beyond external factors like data imbalance to examine internal architectural causes.

Method: Extracted key vectors from LLaVA and Qwen2.5-VL, analyzed distribution using t-SNE visualization and Jensen-Shannon divergence to measure inter-modal vs intra-modal variation.

Result: Visual and textual keys occupy distinct subspaces with statistically significant divergence, exceeding intra-modal variation by orders of magnitude.

Conclusion: Text bias originates from intrinsic misalignment in attention key space rather than solely external data factors.

Abstract: Multimodal large language models (MLLMs) exhibit a pronounced preference for textual inputs when processing vision-language data, limiting their ability to reason effectively from visual evidence. Unlike prior studies that attribute this text bias to external factors such as data imbalance or instruction tuning, we propose that the bias originates from the model’s internal architecture. Specifically, we hypothesize that visual key vectors (Visual Keys) are out-of-distribution (OOD) relative to the text key space learned during language-only pretraining. Consequently, these visual keys receive systematically lower similarity scores during attention computation, leading to their under-utilization in the context representation. To validate this hypothesis, we extract key vectors from LLaVA and Qwen2.5-VL and analyze their distributional structures using qualitative (t-SNE) and quantitative (Jensen-Shannon divergence) methods. The results provide direct evidence that visual and textual keys occupy markedly distinct subspaces within the attention space. The inter-modal divergence is statistically significant, exceeding intra-modal variation by several orders of magnitude. These findings reveal that text bias arises from an intrinsic misalignment within the attention key space rather than solely from external data factors.

[242] Symbolically Scaffolded Play: Designing Role-Sensitive Prompts for Generative NPC Dialogue

Vanessa Figueiredo, David Elumeze

Main category: cs.AI

TL;DR: This paper investigates whether constrained prompts improve player experience in LLM-powered games, finding that scaffolding effects are role-dependent and overturning the assumption that tighter constraints inherently enhance play.

DetailsMotivation: To determine whether constrained prompts actually improve player experience in interactive games powered by large language models, as it remains unclear despite the promise of LLMs for enabling unscripted dialogue in NPCs.

Method: Conducted a within-subjects usability study (N=10) comparing high-constraint and low-constraint prompts in a voice-based detective game powered by GPT-4o, followed by redesigning HCP into a hybrid JSON+RAG scaffold and synthetic evaluation with an LLM judge.

Result: No reliable experiential differences between constraint levels beyond sensitivity to technical breakdowns. Scaffolding effects were role-dependent: Interviewer NPC gained stability while suspect NPCs lost improvisational believability.

Conclusion: Tighter constraints do not inherently enhance play. Introduced Symbolically Scaffolded Play framework where symbolic structures are expressed as fuzzy numerical boundaries to stabilize coherence while preserving improvisation for engagement.

Abstract: Large Language Models (LLMs) promise to transform interactive games by enabling non-player characters (NPCs) to sustain unscripted dialogue. Yet it remains unclear whether constrained prompts actually improve player experience. We investigate this question through The Interview, a voice-based detective game powered by GPT-4o. A within-subjects usability study ($N=10$) compared high-constraint (HCP) and low-constraint (LCP) prompts, revealing no reliable experiential differences beyond sensitivity to technical breakdowns. Guided by these findings, we redesigned the HCP into a hybrid JSON+RAG scaffold and conducted a synthetic evaluation with an LLM judge, positioned as an early-stage complement to usability testing. Results uncovered a novel pattern: scaffolding effects were role-dependent: the Interviewer (quest-giver NPC) gained stability, while suspect NPCs lost improvisational believability. These findings overturn the assumption that tighter constraints inherently enhance play. Extending fuzzy-symbolic scaffolding, we introduce \textit{Symbolically Scaffolded Play}, a framework in which symbolic structures are expressed as fuzzy, numerical boundaries that stabilize coherence where needed while preserving improvisation where surprise sustains engagement.

[243] Through the Judge’s Eyes: Inferred Thinking Traces Improve Reliability of LLM Raters

Xingjian Zhang, Tianhong Gao, Suliang Jin, Tianhao Wang, Teng Ye, Eytan Adar, Qiaozhu Mei

Main category: cs.AI

TL;DR: A human-LLM collaborative framework that infers thinking traces from label-only annotations to improve LLM rater reliability for subjective evaluation tasks.

DetailsMotivation: LLMs are increasingly used as evaluators but lack reliability for subjective tasks where human judgments involve subtle reasoning beyond simple labels. Thinking traces (the reasoning behind judgments) are informative but hard to collect.

Method: Proposes a rejection sampling method to reconstruct thinking traces from label-only annotations at scale. Uses these traces to fine-tune open LLM raters and synthesize clearer annotation guidelines for proprietary LLM raters.

Result: Significantly improved LLM-human agreement across multiple datasets. Refined annotation guidelines also increased agreement among different LLM models.

Conclusion: LLMs can serve as practical proxies for unrevealed human thinking traces, enabling label-only corpora to be extended into thinking-trace-augmented resources that enhance LLM rater reliability.

Abstract: Large language models (LLMs) are increasingly used as raters for evaluation tasks. However, their reliability is often limited for subjective tasks, when human judgments involve subtle reasoning beyond annotation labels. Thinking traces, the reasoning behind a judgment, are highly informative but challenging to collect and curate. We present a human-LLM collaborative framework to infer thinking traces from label-only annotations. The proposed framework uses a simple and effective rejection sampling method to reconstruct these traces at scale. These inferred thinking traces are applied to two complementary tasks: (1) fine-tuning open LLM raters; and (2) synthesizing clearer annotation guidelines for proprietary LLM raters. Across multiple datasets, our methods lead to significantly improved LLM-human agreement. Additionally, the refined annotation guidelines increase agreement among different LLM models. These results suggest that LLMs can serve as practical proxies for otherwise unrevealed human thinking traces, enabling label-only corpora to be extended into thinking-trace-augmented resources that enhance the reliability of LLM raters.

[244] The Information-Theoretic Imperative: Compression and the Epistemic Foundations of Intelligence

Christian Dittrich, Jennifer Flygare Kinne

Main category: cs.AI

TL;DR: The paper introduces a two-level framework (ITI and CEP) explaining why compression leads to causal structure discovery rather than superficial patterns, linking survival pressure to intelligence through information-theoretic and evolutionary constraints.

DetailsMotivation: To address why compression processes enforce discovery of causal structure rather than superficial statistical patterns, which existing frameworks leave underspecified.

Method: Introduces a two-level framework: Information-Theoretic Imperative (ITI) and Compression Efficiency Principle (CEP). ITI establishes evolutionary pressure for predictive compression, while CEP specifies how efficient compression selects for generative causal models through exception-accumulation dynamics.

Result: The framework yields testable predictions: compression efficiency correlates with out-of-distribution generalization, exception-accumulation rates differentiate causal from correlational models, hierarchical systems show increasing efficiency across abstraction layers, and biological systems demonstrate metabolic costs tracking representational complexity.

Conclusion: ITI and CEP provide a unified account of intelligence convergence across biological, artificial, and multi-scale systems, showing intelligence as a mechanically necessary outcome of persistence in structured environments without invoking consciousness assumptions.

Abstract: Existing frameworks converge on the centrality of compression to intelligence but leave underspecified why this process enforces the discovery of causal structure rather than superficial statistical patterns. We introduce a two-level framework to address this gap. The Information-Theoretic Imperative (ITI) establishes that any system persisting in uncertain environments must minimize epistemic entropy through predictive compression: this is the evolutionary “why” linking survival pressure to information-processing demands. The Compression Efficiency Principle (CEP) specifies how efficient compression mechanically selects for generative, causal models through exception-accumulation dynamics, making reality alignment a consequence rather than a contingent achievement. Together, ITI and CEP define a causal chain: from survival pressure to prediction necessity, compression requirement, efficiency optimization, generative structure discovery, and ultimately reality alignment. Each link follows from physical, information-theoretic, or evolutionary constraints, implying that intelligence is the mechanically necessary outcome of persistence in structured environments. This framework yields empirically testable predictions: compression efficiency, measured as approach to the rate-distortion frontier, correlates with out-of-distribution generalization; exception-accumulation rates differentiate causal from correlational models; hierarchical systems exhibit increasing efficiency across abstraction layers; and biological systems demonstrate metabolic costs that track representational complexity. ITI and CEP thereby provide a unified account of convergence across biological, artificial, and multi-scale systems, addressing the epistemic and functional dimensions of intelligence without invoking assumptions about consciousness or subjective experience.

[245] Approximating Human Preferences Using a Multi-Judge Learned System

Eitán Sprejer, Fernando Avalos, Augusto Bernardi, Jose Pedro Brito de Azevedo Faustino, Jacob Haimes, Narmeen Fatimah Oozeer

Main category: cs.AI

TL;DR: A framework for modeling diverse persona-based preferences by aggregating multiple rubric-conditioned LLM judges to address calibration challenges, bias, and instability in LLM-based evaluation systems.

DetailsMotivation: Aligning LLM-based judges with human preferences is challenging due to calibration difficulties, rubric sensitivity, bias, and instability, which hinders applications like reliable reward models for RLHF and effective model routing systems.

Method: Proposes a persona-based framework that learns to aggregate outputs from multiple rubric-conditioned judges, with two implementations: Generalized Additive Model (GAM) and Multi-Layer Perceptron (MLP).

Result: The approach investigates performance against naive baselines and assesses robustness through case studies on human and LLM-judge biases.

Conclusion: The primary contributions include a persona-based method for synthesizing preference labels at scale and two distinct aggregator implementations to improve alignment of LLM judges with human preferences.

Abstract: Aligning LLM-based judges with human preferences is a significant challenge, as they are difficult to calibrate and often suffer from rubric sensitivity, bias, and instability. Overcoming this challenge advances key applications, such as creating reliable reward models for Reinforcement Learning from Human Feedback (RLHF) and building effective routing systems that select the best-suited model for a given user query. In this work, we propose a framework for modeling diverse, persona-based preferences by learning to aggregate outputs from multiple rubric-conditioned judges. We investigate the performance of this approach against naive baselines and assess its robustness through case studies on both human and LLM-judges biases. Our primary contributions include a persona-based method for synthesizing preference labels at scale and two distinct implementations of our aggregator: Generalized Additive Model (GAM) and a Multi-Layer Perceptron (MLP).

[246] SciTrust 2.0: A Comprehensive Framework for Evaluating Trustworthiness of Large Language Models in Scientific Applications

Emily Herron, Junqi Yin, Feiyi Wang

Main category: cs.AI

TL;DR: SciTrust 2.0 is a comprehensive framework for evaluating LLM trustworthiness in scientific applications across four dimensions: truthfulness, adversarial robustness, scientific safety, and scientific ethics. It includes novel benchmarks and evaluates seven LLMs, finding general-purpose models outperform science-specialized ones.

DetailsMotivation: LLMs have transformative potential in scientific research but raise significant trustworthiness concerns in high-stakes contexts, necessitating comprehensive evaluation frameworks.

Method: Developed SciTrust 2.0 framework with novel open-ended truthfulness benchmarks using verified reflection-tuning pipeline and expert validation, plus novel ethics benchmark covering eight subcategories. Evaluated seven LLMs using multiple metrics including accuracy, semantic similarity, and LLM-based scoring.

Result: General-purpose industry models outperformed science-specialized models across all trustworthiness dimensions. GPT-4-mini showed superior performance in truthfulness and adversarial robustness. Science-specialized models showed significant deficiencies in logical/ethical reasoning and concerning safety vulnerabilities in high-risk domains.

Conclusion: The framework provides foundation for developing more trustworthy AI systems and advancing research on model safety and ethics in scientific contexts. The framework is open-sourced to support further development.

Abstract: Large language models (LLMs) have demonstrated transformative potential in scientific research, yet their deployment in high-stakes contexts raises significant trustworthiness concerns. Here, we introduce SciTrust 2.0, a comprehensive framework for evaluating LLM trustworthiness in scientific applications across four dimensions: truthfulness, adversarial robustness, scientific safety, and scientific ethics. Our framework incorporates novel, open-ended truthfulness benchmarks developed through a verified reflection-tuning pipeline and expert validation, alongside a novel ethics benchmark for scientific research contexts covering eight subcategories including dual-use research and bias. We evaluated seven prominent LLMs, including four science-specialized models and three general-purpose industry models, using multiple evaluation metrics including accuracy, semantic similarity measures, and LLM-based scoring. General-purpose industry models overall outperformed science-specialized models across each trustworthiness dimension, with GPT-o4-mini demonstrating superior performance in truthfulness assessments and adversarial robustness. Science-specialized models showed significant deficiencies in logical and ethical reasoning capabilities, along with concerning vulnerabilities in safety evaluations, particularly in high-risk domains such as biosecurity and chemical weapons. By open-sourcing our framework, we provide a foundation for developing more trustworthy AI systems and advancing research on model safety and ethics in scientific contexts.

[247] FinOps Agent – A Use-Case for IT Infrastructure and Cost Optimization

Ngoc Phuoc An Vo, Manish Kesarwani, Ruchi Mahindru, Chandrasekhar Narayanaswami

Main category: cs.AI

TL;DR: The paper proposes using autonomous AI agents to automate FinOps processes, addressing challenges with heterogeneous cloud billing data by building an agent that can retrieve, consolidate, analyze data and generate optimization recommendations comparable to human practitioners.

DetailsMotivation: FinOps practitioners face fundamental challenges with heterogeneous billing data formats, taxonomies, and metrics from multiple cloud providers and internal systems, making it difficult to synthesize actionable insights and make time-sensitive decisions.

Method: Built a FinOps agent for IT infrastructure and cost optimization use-case, simulating a realistic end-to-end industry process from data retrieval from various sources to consolidation, analysis, and generating optimization recommendations.

Result: The agent was evaluated using several open-source and closed-source language models with defined metrics, showing it could understand, plan, and execute tasks as well as an actual FinOps practitioner.

Conclusion: Autonomous, goal-driven AI agents can effectively automate FinOps processes and perform comparably to human practitioners in handling complex cloud billing data and generating optimization insights.

Abstract: FinOps (Finance + Operations) represents an operational framework and cultural practice which maximizes cloud business value through collaborative financial accountability across engineering, finance, and business teams. FinOps practitioners face a fundamental challenge: billing data arrives in heterogeneous formats, taxonomies, and metrics from multiple cloud providers and internal systems which eventually lead to synthesizing actionable insights, and making time-sensitive decisions. To address this challenge, we propose leveraging autonomous, goal-driven AI agents for FinOps automation. In this paper, we built a FinOps agent for a typical use-case for IT infrastructure and cost optimization. We built a system simulating a realistic end-to-end industry process starting with retrieving data from various sources to consolidating and analyzing the data to generate recommendations for optimization. We defined a set of metrics to evaluate our agent using several open-source and close-source language models and it shows that the agent was able to understand, plan, and execute tasks as well as an actual FinOps practitioner.

[248] Humains-Junior: A 3.8B Language Model Achieving GPT-4o-Level Factual Accuracy by Directed Exoskeleton Reasoning

Nissan Yaron, Dan Bystritsky, Ben-Etzion Yaron

Main category: cs.AI

TL;DR: A 3.8B model achieves GPT-4o-level FACTS accuracy (equivalent within ±5 pp on Q1-Q500) with ≈19× lower cloud cost versus GPT-4o, and self-hosted/edge deployments can approach zero marginal cost.

DetailsMotivation: To develop cost-efficient small language models that can match the performance of much larger frontier models like GPT-4o on factual grounding tasks, enabling more accessible and affordable AI deployment.

Method: Combines minimal directed “Exoskeleton Reasoning” scaffolds with behavioral fine-tuning that teaches protocol compliance (epistemic discipline) rather than domain answers. The approach synergizes reasoning scaffolds with fine-tuning for significant performance gains.

Result: Humans-Junior (3.8B) matches GPT-4o on FACTS Grounding subset: GPT-4o 73.5% vs Humans-Junior 72.7% with paired difference of 0.8 pp. TOST establishes equivalence at ±5 pp margin. Combined method provides +17.7 pp improvement (p < 0.001) and reduces variance by ≈25%.

Conclusion: Small language models can achieve frontier model performance on factual grounding tasks through strategic combination of directed reasoning scaffolds and behavioral fine-tuning, offering dramatically lower cost (≈19× cheaper) and deployment flexibility.

Abstract: We introduce Humans-Junior, a 3.8B model that matches GPT-4o on the FACTS Grounding public subset within a $\pm 5$ pp equivalence margin. Results. On Q1–Q500 under identical judges, GPT-4o scores 73.5% (95% CI 69.5–77.2) and Humans-Junior 72.7% (95% CI 68.7–76.5); the paired difference is 0.8 pp (bootstrap 95% CI $-3.1$ to $+4.7$; permutation $p = 0.72$; Cohen’s $d = 0.023$). TOST establishes equivalence at $\pm 5$ pp (not at $\pm 3$ pp). When purchased as managed APIs, Humans-Junior’s base model (Phi-3.5-mini-instruct) is $\approx 19\times$ less expensive than GPT-4o on Microsoft AI Foundry pricing; self-hosted or edge deployments can drive incremental inference cost toward zero. Measured vs estimated pricing sources are tabulated in Appendix E. Method. Our approach combines minimal directed “Exoskeleton Reasoning” scaffolds with behavioral fine-tuning that teaches protocol compliance (epistemic discipline) rather than domain answers. Fine-tuning alone adds little; combined, they synergize (+17.7 pp, $p < 0.001$) and reduce variance ($\approx 25%$). In prompt-only settings on frontier models (Q1–Q100; non-comparable), directed reasoning improved GPT-4o by +11.8 pp to 85.3% and Gemini-2.5-Pro by +5.0 pp to 93.3% (baseline 88.3%, $n = 100$); see Section~5. TL;DR. A 3.8B model achieves GPT-4o-level FACTS accuracy (equivalent within $\pm 5$ pp on Q1–Q500). Cloud pricing shows $\approx 19\times$ lower cost versus GPT-4o, and self-hosted/edge deployments can approach zero marginal cost. Pricing sources are listed in Appendix E. Frontier prompt-only gains (Q1–Q100; non-comparable) and optimized-prompt exploratory results under earlier judges are summarized in Appendix F. Keywords: Small Language Models, Factual Grounding, Directed Reasoning, Fine-Tuning, Model Alignment, Cost-Efficient AI

[249] Estimating cognitive biases with attention-aware inverse planning

Sounak Banerjee, Daphne Cornelisse, Deepak Gopinath, Emily Sumner, Jonathan DeCastro, Guy Rosman, Eugene Vinitsky, Mark K. Ho

Main category: cs.AI

TL;DR: The paper introduces attention-aware inverse planning to estimate people’s cognitive biases from their actions, combining deep reinforcement learning with cognitive modeling to infer attentional strategies in real-life driving scenarios.

DetailsMotivation: Autonomous systems need to understand human cognitive biases that influence goal-directed behaviors, particularly in tasks like driving where attention biases systematically affect performance.

Method: Combines deep reinforcement learning with computational cognitive modeling to solve the attention-aware inverse planning problem, which estimates attentional biases from observed actions.

Result: Demonstrated the approach can infer attentional strategies of RL agents in real-life driving scenarios from the Waymo Open Dataset, showing scalability of cognitive bias estimation.

Conclusion: Attention-aware inverse planning provides a scalable method to estimate cognitive biases from behavior, systematically differing from standard inverse reinforcement learning approaches.

Abstract: People’s goal-directed behaviors are influenced by their cognitive biases, and autonomous systems that interact with people should be aware of this. For example, people’s attention to objects in their environment will be biased in a way that systematically affects how they perform everyday tasks such as driving to work. Here, building on recent work in computational cognitive science, we formally articulate the attention-aware inverse planning problem, in which the goal is to estimate a person’s attentional biases from their actions. We demonstrate how attention-aware inverse planning systematically differs from standard inverse reinforcement learning and how cognitive biases can be inferred from behavior. Finally, we present an approach to attention-aware inverse planning that combines deep reinforcement learning with computational cognitive modeling. We use this approach to infer the attentional strategies of RL agents in real-life driving scenarios selected from the Waymo Open Dataset, demonstrating the scalability of estimating cognitive biases with attention-aware inverse planning.

[250] From Queries to Insights: Agentic LLM Pipelines for Spatio-Temporal Text-to-SQL

Manu Redd, Tao Zhe, Dongjie Wang

Main category: cs.AI

TL;DR: An agentic pipeline improves NL-to-SQL systems for spatio-temporal queries by using a ReAct agent to plan, decompose, and adapt queries through schema inspection, SQL generation, execution, and visualization tools.

DetailsMotivation: Existing NL-to-SQL systems struggle with realistic spatio-temporal queries that require aligning vague user phrasing with schema-specific categories, handling temporal reasoning, and choosing appropriate outputs.

Method: Extends a naive text-to-SQL baseline with orchestration by a Mistral-based ReAct agent that can plan, decompose, and adapt queries through schema inspection, SQL generation, execution, and visualization tools.

Result: Achieves substantially higher accuracy than the naive baseline (91.4% vs. 28.6%) on 35 natural-language queries over NYC and Tokyo check-in dataset, and enhances usability through maps, plots, and structured natural-language summaries.

Conclusion: Agentic orchestration, rather than stronger SQL generators alone, is a promising foundation for interactive geospatial assistants that enable more natural human-database interaction for users without SQL expertise.

Abstract: Natural-language-to-SQL (NL-to-SQL) systems hold promise for democratizing access to structured data, allowing users to query databases without learning SQL. Yet existing systems struggle with realistic spatio-temporal queries, where success requires aligning vague user phrasing with schema-specific categories, handling temporal reasoning, and choosing appropriate outputs. We present an agentic pipeline that extends a naive text-to-SQL baseline (llama-3-sqlcoder-8b) with orchestration by a Mistral-based ReAct agent. The agent can plan, decompose, and adapt queries through schema inspection, SQL generation, execution, and visualization tools. We evaluate on 35 natural-language queries over the NYC and Tokyo check-in dataset, covering spatial, temporal, and multi-dataset reasoning. The agent achieves substantially higher accuracy than the naive baseline 91.4% vs. 28.6% and enhances usability through maps, plots, and structured natural-language summaries. Crucially, our design enables more natural human-database interaction, supporting users who lack SQL expertise, detailed schema knowledge, or prompting skill. We conclude that agentic orchestration, rather than stronger SQL generators alone, is a promising foundation for interactive geospatial assistants.

[251] Agentic AI Home Energy Management System: A Large Language Model Framework for Residential Load Scheduling

Reda El Makroum, Sebastian Zwickl-Bernhard, Lukas Kranzl

Main category: cs.AI

TL;DR: An AI-powered Home Energy Management System uses LLMs as autonomous coordinators to translate natural language preferences into multi-appliance scheduling, achieving optimal energy cost savings without requiring technical expertise from users.

DetailsMotivation: To overcome user interaction barriers in HEMS adoption by eliminating the need for users to translate everyday preferences into technical parameters, enabling broader residential demand response participation.

Method: Hierarchical agentic architecture with one orchestrator and three specialist agents using ReAct pattern for iterative reasoning, integrating Google Calendar for context-aware deadline extraction, and evaluating across multiple LLMs with real electricity price data.

Result: Llama-3.3-70B successfully coordinated all appliances across all scenarios to match cost-optimal benchmarks, while other models achieved perfect single-appliance performance but struggled with simultaneous multi-appliance coordination.

Conclusion: LLMs can effectively serve as autonomous coordinators for HEMS, with larger models demonstrating superior multi-appliance coordination capabilities, though analytical query handling without explicit guidance remains challenging.

Abstract: The electricity sector transition requires substantial increases in residential demand response capacity, yet Home Energy Management Systems (HEMS) adoption remains limited by user interaction barriers requiring translation of everyday preferences into technical parameters. While large language models have been applied to energy systems as code generators and parameter extractors, no existing implementation deploys LLMs as autonomous coordinators managing the complete workflow from natural language input to multi-appliance scheduling. This paper presents an agentic AI HEMS where LLMs autonomously coordinate multi-appliance scheduling from natural language requests to device control, achieving optimal scheduling without example demonstrations. A hierarchical architecture combining one orchestrator with three specialist agents uses the ReAct pattern for iterative reasoning, enabling dynamic coordination without hardcoded workflows while integrating Google Calendar for context-aware deadline extraction. Evaluation across three open-source models using real Austrian day-ahead electricity prices reveals substantial capability differences. Llama-3.3-70B successfully coordinates all appliances across all scenarios to match cost-optimal benchmarks computed via mixed-integer linear programming, while other models achieve perfect single-appliance performance but struggle to coordinate all appliances simultaneously. Progressive prompt engineering experiments demonstrate that analytical query handling without explicit guidance remains unreliable despite models’ general reasoning capabilities. We open-source the complete system including orchestration logic, agent prompts, tools, and web interfaces to enable reproducibility, extension, and future research.

[252] AutoSurvey2: Empowering Researchers with Next Level Automated Literature Surveys

Siyi Wu, Chiaxin Liang, Ziqian Bi, Leyi Zhao, Tianyang Wang, Junhao Song, Yichao Zhang, Keyu Chen, Xinyuan Song

Main category: cs.AI

TL;DR: autosurvey2 is an automated multi-stage pipeline for generating comprehensive academic surveys using retrieval-augmented synthesis and structured evaluation, outperforming existing baselines in quality and relevance.

DetailsMotivation: The rapid growth of research literature, especially in LLMs, makes producing current and comprehensive survey papers increasingly difficult and time-consuming.

Method: Multi-stage pipeline with parallel section generation, iterative refinement, real-time retrieval of recent publications, and multi-LLM evaluation framework for quality assessment.

Result: autosurvey2 consistently outperforms existing retrieval-based and automated baselines, achieving higher scores in structural coherence, topical relevance, and citation fidelity.

Conclusion: autosurvey2 provides a scalable and reproducible solution for automated scholarly writing and establishes a foundation for future research in this area.

Abstract: The rapid growth of research literature, particularly in large language models (LLMs), has made producing comprehensive and current survey papers increasingly difficult. This paper introduces autosurvey2, a multi-stage pipeline that automates survey generation through retrieval-augmented synthesis and structured evaluation. The system integrates parallel section generation, iterative refinement, and real-time retrieval of recent publications to ensure both topical completeness and factual accuracy. Quality is assessed using a multi-LLM evaluation framework that measures coverage, structure, and relevance in alignment with expert review standards. Experimental results demonstrate that autosurvey2 consistently outperforms existing retrieval-based and automated baselines, achieving higher scores in structural coherence and topical relevance while maintaining strong citation fidelity. By combining retrieval, reasoning, and automated evaluation into a unified framework, autosurvey2 provides a scalable and reproducible solution for generating long-form academic surveys and contributes a solid foundation for future research on automated scholarly writing. All code and resources are available at https://github.com/annihi1ation/auto_research.

[253] MedAgentBoard: Benchmarking Multi-Agent Collaboration with Conventional Methods for Diverse Medical Tasks

Yinghao Zhu, Ziyi He, Haoran Hu, Xiaochen Zheng, Xichen Zhang, Zixiang Wang, Junyi Gao, Liantao Ma, Lequan Yu

Main category: cs.AI

TL;DR: MedAgentBoard is a comprehensive benchmark evaluating multi-agent collaboration vs single-LLM and conventional methods across diverse medical tasks, revealing nuanced performance trade-offs.

DetailsMotivation: To address the insufficient understanding of practical advantages of multi-agent collaboration in medical AI, and the lack of generalizable evaluations covering diverse real-world clinical tasks with rigorous comparisons.

Method: Developed MedAgentBoard benchmark covering four medical task categories: medical (visual) question answering, lay summary generation, structured EHR predictive modeling, and clinical workflow automation across text, images, and EHR data.

Result: Multi-agent collaboration shows benefits in specific scenarios (e.g., task completeness in workflow automation) but doesn’t consistently outperform advanced single LLMs (in textual medical QA) or specialized conventional methods (in medical VQA and EHR prediction).

Conclusion: Task-specific, evidence-based approach is necessary for selecting AI solutions in medicine, as multi-agent collaboration’s complexity and overhead must be carefully weighed against performance gains.

Abstract: The rapid advancement of Large Language Models (LLMs) has stimulated interest in multi-agent collaboration for addressing complex medical tasks. However, the practical advantages of multi-agent collaboration approaches remain insufficiently understood. Existing evaluations often lack generalizability, failing to cover diverse tasks reflective of real-world clinical practice, and frequently omit rigorous comparisons against both single-LLM-based and established conventional methods. To address this critical gap, we introduce MedAgentBoard, a comprehensive benchmark for the systematic evaluation of multi-agent collaboration, single-LLM, and conventional approaches. MedAgentBoard encompasses four diverse medical task categories: (1) medical (visual) question answering, (2) lay summary generation, (3) structured Electronic Health Record (EHR) predictive modeling, and (4) clinical workflow automation, across text, medical images, and structured EHR data. Our extensive experiments reveal a nuanced landscape: while multi-agent collaboration demonstrates benefits in specific scenarios, such as enhancing task completeness in clinical workflow automation, it does not consistently outperform advanced single LLMs (e.g., in textual medical QA) or, critically, specialized conventional methods that generally maintain better performance in tasks like medical VQA and EHR-based prediction. MedAgentBoard offers a vital resource and actionable insights, emphasizing the necessity of a task-specific, evidence-based approach to selecting and developing AI solutions in medicine. It underscores that the inherent complexity and overhead of multi-agent collaboration must be carefully weighed against tangible performance gains. All code, datasets, detailed prompts, and experimental results are open-sourced at https://medagentboard.netlify.app/.

[254] Large Language Model-assisted Autonomous Vehicle Recovery from Immobilization

Zhipeng Bao, Qianwen Li

Main category: cs.AI

TL;DR: StuckSolver is an LLM-driven framework that enables autonomous vehicles to resolve immobilization scenarios through self-reasoning and passenger-guided decision-making, operating as a plug-in module without modifying the AV’s core architecture.

DetailsMotivation: Current AV recovery solutions like remote intervention and manual takeover are inadequate - they're either costly/inefficient or exclude non-drivers, limiting AV accessibility. AVs often get immobilized in scenarios where human drivers excel, disrupting traffic flow.

Method: StuckSolver is designed as a plug-in add-on module that interfaces with standard sensor data streams to detect immobilization states, interpret environmental context, and generate high-level recovery commands for the AV’s native planner. It uses LLM-driven reasoning for decision-making.

Result: Evaluation on Bench2Drive benchmark and custom uncertainty scenarios shows StuckSolver achieves near-state-of-the-art performance through autonomous self-reasoning alone, with further improvements when passenger guidance is incorporated.

Conclusion: StuckSolver provides an effective solution for AV immobilization recovery that enhances accessibility and performance without requiring modifications to the AV’s core architecture.

Abstract: Despite significant advancements in recent decades, autonomous vehicles (AVs) continue to face challenges in navigating certain traffic scenarios where human drivers excel. In such situations, AVs often become immobilized, disrupting overall traffic flow. Current recovery solutions, such as remote intervention (which is costly and inefficient) and manual takeover (which excludes non-drivers and limits AV accessibility), are inadequate. This paper introduces StuckSolver, a novel Large Language Model (LLM) driven recovery framework that enables AVs to resolve immobilization scenarios through self-reasoning and/or passenger-guided decision-making. StuckSolver is designed as a plug-in add-on module that operates on top of the AV’s existing perception-planning-control stack, requiring no modification to its internal architecture. Instead, it interfaces with standard sensor data streams to detect immobilization states, interpret environmental context, and generate high-level recovery commands that can be executed by the AV’s native planner. We evaluate StuckSolver on the Bench2Drive benchmark and in custom-designed uncertainty scenarios. Results show that StuckSolver achieves near-state-of-the-art performance through autonomous self-reasoning alone and exhibits further improvements when passenger guidance is incorporated.

[255] Can AI be Accountable?

Andrew L. Kun

Main category: cs.AI

TL;DR: The paper argues that AI systems must be accountable to users, voters, and decision makers through mechanisms for information requests, discussion, and sanctions, but current AI often lacks these accountability features.

DetailsMotivation: As AI becomes increasingly powerful, it's crucial that it serves human needs and can be held accountable by those affected by its actions, which requires establishing proper accountability mechanisms.

Method: The chapter relates general accountability definitions to AI, illustrates what accountable vs unaccountable AI looks like, and explores approaches to ensure AI accountability.

Result: The analysis shows that current AI systems often lack accountability features like questioning, discussion, and sanctioning capabilities that are essential for proper oversight.

Conclusion: Developing approaches to make all AI accountable to affected parties is essential for ensuring AI serves human interests responsibly as its power continues to grow.

Abstract: The AI we use is powerful, and its power is increasing rapidly. If this powerful AI is to serve the needs of consumers, voters, and decision makers, then it is imperative that the AI is accountable. In general, an agent is accountable to a forum if the forum can request information from the agent about its actions, if the forum and the agent can discuss this information, and if the forum can sanction the agent. Unfortunately, in too many cases today’s AI is not accountable – we cannot question it, enter into a discussion with it, let alone sanction it. In this chapter we relate the general definition of accountability to AI, we illustrate what it means for AI to be accountable and unaccountable, and we explore approaches that can improve our chances of living in a world where all AI is accountable to those who are affected by it.

[256] Lean4Physics: Comprehensive Reasoning Framework for College-level Physics in Lean4

Yuxin Li, Minghao Liu, Ruida Wang, Wenzhao Ji, Zhitao He, Rui Pan, Junming Huang, Tong Zhang, Yi R. Fung

Main category: cs.AI

TL;DR: Lean4PHYS is a formal reasoning framework for college physics problems in Lean4, featuring LeanPhysBench (200 physics problems) and PhysLib (foundational physics theorems). Baseline results show low performance (16-35%), demonstrating the benchmark’s difficulty and PhysLib’s effectiveness (+11.75% improvement).

DetailsMotivation: To establish the first formal physics reasoning benchmark in Lean4 for college-level physics problems, addressing the gap in formal verification tools for physics education and research.

Method: Created LeanPhysBench with 200 hand-crafted physics problems from textbooks and competitions, developed PhysLib as a foundational physics theorem repository, and evaluated using expert Lean4 provers and state-of-the-art AI models.

Result: Best performance was 16% (DeepSeek-Prover-V2-7B) and 35% (Claude-Sonnet-4), showing the benchmark’s difficulty. PhysLib improved model performance by average 11.75%.

Conclusion: Lean4PHYS provides a challenging benchmark for formal physics reasoning, demonstrating both the difficulty of physics formalization and the effectiveness of foundational physics libraries in improving automated reasoning capabilities.

Abstract: We present Lean4PHYS, a comprehensive reasoning framework for college-level physics problems in Lean4. Lean4PHYS includes LeanPhysBench, a college-level benchmark for formal physics reasoning in Lean4, which contains 200 hand-crafted and peer-reviewed statements derived from university textbooks and physics competition problems. To establish a solid foundation for formal reasoning in physics, we also introduce PhysLib, a community-driven repository containing fundamental unit systems and theorems essential for formal physics reasoning. Based on the benchmark and Lean4 repository we composed in Lean4PHYS, we report baseline results using major expert Math Lean4 provers and state-of-the-art closed-source models, with the best performance of DeepSeek-Prover-V2-7B achieving only 16% and Claude-Sonnet-4 achieving 35%. We also conduct a detailed analysis showing that our PhysLib can achieve an average improvement of 11.75% in model performance. This demonstrates the challenging nature of our LeanPhysBench and the effectiveness of PhysLib. To the best of our knowledge, this is the first study to provide a physics benchmark in Lean4.

[257] GUI Knowledge Bench: Revealing the Knowledge Gap Behind VLM Failures in GUI Tasks

Chenrui Shi, Zedong Yu, Zhi Gao, Ruining Feng, Enqi Liu, Yuwei Wu, Yunde Jia, Liuyu Xiang, Zhaofeng He, Qing Li

Main category: cs.AI

TL;DR: This paper identifies three key dimensions of GUI knowledge that current VLMs lack, introduces a benchmark to assess this knowledge, and shows that GUI knowledge is crucial for successful GUI task automation.

DetailsMotivation: Large vision language models lag behind humans in GUI task automation due to missing core GUI knowledge that existing training schemes cannot fully address.

Method: Analyzed common failure patterns in GUI task execution and distilled GUI knowledge into three dimensions: interface perception, interaction prediction, and instruction understanding. Created GUI Knowledge Bench benchmark with multiple choice and yes/no questions across six platforms and 292 applications.

Result: Current VLMs can identify widget functions but struggle with perceiving system states, predicting actions, and verifying task completion. Experiments show a close link between GUI knowledge and task success.

Conclusion: The structured framework for assessing GUI knowledge helps select VLMs with greater potential before downstream training and provides insights for building more capable GUI agents.

Abstract: Large vision language models (VLMs) have advanced graphical user interface (GUI) task automation but still lag behind humans. We hypothesize this gap stems from missing core GUI knowledge, which existing training schemes (such as supervised fine tuning and reinforcement learning) alone cannot fully address. By analyzing common failure patterns in GUI task execution, we distill GUI knowledge into three dimensions: (1) interface perception, knowledge about recognizing widgets and system states; (2) interaction prediction, knowledge about reasoning action state transitions; and (3) instruction understanding, knowledge about planning, verifying, and assessing task completion progress. We further introduce GUI Knowledge Bench, a benchmark with multiple choice and yes/no questions across six platforms (Web, Android, MacOS, Windows, Linux, IOS) and 292 applications. Our evaluation shows that current VLMs identify widget functions but struggle with perceiving system states, predicting actions, and verifying task completion. Experiments on real world GUI tasks further validate the close link between GUI knowledge and task success. By providing a structured framework for assessing GUI knowledge, our work supports the selection of VLMs with greater potential prior to downstream training and provides insights for building more capable GUI agents.

[258] Beyond Benchmarks: The Economics of AI Inference

Boqin Zhuang, Jiacheng Qiao, Mingqian Liu, Mingxing Yu, Ping Hong, Rui Li, Xiaoxia Song, Xiangjun Xu, Xu Chen, Yaoyao Ma, Yujie Gao

Main category: cs.AI

TL;DR: This paper introduces an economic framework for LLM inference costs, analyzing marginal costs, economies of scale, and output quality, revealing principles of diminishing marginal cost and returns to scale through empirical data.

DetailsMotivation: The high inference cost of Large Language Models has become a critical factor affecting their commercial viability and widespread adoption, necessitating an economic analysis of inference processes.

Method: Developed a quantitative ’economics of inference’ framework treating LLM inference as compute-driven production, analyzed using empirical data from WiNEval-3.0 to construct the ‘LLM Inference Production Frontier’.

Result: Revealed three key principles: diminishing marginal cost, diminishing returns to scale, and the existence of an optimal cost-effectiveness zone for LLM inference.

Conclusion: Provides an economic basis for model deployment decisions and lays empirical foundation for market-based pricing and optimization of AI inference resources.

Abstract: The inference cost of Large Language Models (LLMs) has become a critical factor in determining their commercial viability and widespread adoption. This paper introduces a quantitative economics of inference'' framework, treating the LLM inference process as a compute-driven intelligent production activity. We analyze its marginal cost, economies of scale, and quality of output under various performance configurations. Based on empirical data from WiNEval-3.0, we construct the first LLM Inference Production Frontier,’’ revealing three principles: diminishing marginal cost, diminishing returns to scale, and an optimal cost-effectiveness zone. This paper not only provides an economic basis for model deployment decisions but also lays an empirical foundation for the future market-based pricing and optimization of AI inference resources.

[259] Reasoning Curriculum: Bootstrapping Broad LLM Reasoning from Math

Bo Pang, Deqian Kong, Silvio Savarese, Caiming Xiong, Yingbo Zhou

Main category: cs.AI

TL;DR: Reasoning Curriculum is a two-stage RL approach that first develops reasoning skills in math domains, then transfers them to other domains through joint reinforcement learning.

DetailsMotivation: Most RL efforts for LLMs focus only on math and code, but reasoning skills should be transferable across domains. The goal is to develop general reasoning capabilities beyond specialized domains.

Method: Two-stage curriculum: Stage 1 performs math-only RL with verifiable rewards to build reasoning skills. Stage 2 runs joint RL on mixed-domain data to transfer and consolidate these skills across domains.

Result: The approach yields consistent gains on Qwen3-4B and Llama-3.1-8B across multi-domain evaluations. Both stages are necessary, and math-first elicitation increases cognitive behaviors important for complex problem-solving.

Conclusion: Reasoning Curriculum provides a compact, easy-to-adopt recipe for developing general reasoning capabilities in LLMs through domain transfer from math to broader domains.

Abstract: Reinforcement learning (RL) can elicit strong reasoning in large language models (LLMs), yet most open efforts focus on math and code. We propose Reasoning Curriculum, a simple two-stage curriculum that first elicits reasoning skills in pretraining-aligned domains such as math, then adapts and refines these skills across other domains via joint RL. Stage 1 performs a brief cold start and then math-only RL with verifiable rewards to develop reasoning skills. Stage 2 runs joint RL on mixed-domain data to transfer and consolidate these skills. The curriculum is minimal and backbone-agnostic, requiring no specialized reward models beyond standard verifiability checks. Evaluated on Qwen3-4B and Llama-3.1-8B over a multi-domain suite, reasoning curriculum yields consistent gains. Ablations and a cognitive-skill analysis indicate that both stages are necessary and that math-first elicitation increases cognitive behaviors important for solving complex problems. Reasoning Curriculum provides a compact, easy-to-adopt recipe for general reasoning.

[260] The FM Agent

Annan Li, Chufan Wu, Zengle Ge, Yee Hin Chong, Zhinan Hou, Lizhe Cao, Cheng Ju, Jianmin Wu, Huaiming Li, Haobo Zhang, Shenghao Feng, Mo Zhao, Fengzhi Qiu, Rui Yang, Mengmeng Zhang, Wenyi Zhu, Yingying Sun, Quan Sun, Shunhao Yan, Danyu Liu, Dawei Yin, Dou Shen

Main category: cs.AI

TL;DR: FM Agent is a multi-agent framework combining LLM reasoning and evolutionary search to solve complex real-world problems autonomously, achieving state-of-the-art results across multiple domains.

DetailsMotivation: To develop autonomous AI research agents that can address complex scientific and engineering challenges without human intervention, accelerating innovation and discovery.

Method: Combines LLM-based reasoning with large-scale evolutionary search, featuring cold-start initialization, evolutionary sampling strategy, domain-specific evaluators, and distributed execution infrastructure.

Result: Achieved SOTA results: 1976.3 on ALE-Bench (+5.2%), 43.56% on MLE-Bench (+4.0pp), up to 20x speedups on KernelBench, and new SOTA on classical mathematical problems.

Conclusion: FM Agent demonstrates broad applicability for enterprise R&D and scientific research, capable of automating complex discovery processes and delivering substantial engineering and scientific advances.

Abstract: Large language models (LLMs) are catalyzing the development of autonomous AI research agents for scientific and engineering discovery. We present FM Agent, a novel and general-purpose multi-agent framework that leverages a synergistic combination of LLM-based reasoning and large-scale evolutionary search to address complex real-world challenges. The core of FM Agent integrates several key innovations: 1) a cold-start initialization phase incorporating expert guidance, 2) a novel evolutionary sampling strategy for iterative optimization, 3) domain-specific evaluators that combine correctness, effectiveness, and LLM-supervised feedback, and 4) a distributed, asynchronous execution infrastructure built on Ray. Demonstrating broad applicability, our system has been evaluated across diverse domains, including operations research, machine learning, GPU kernel optimization, and classical mathematical problems. FM Agent reaches state-of-the-art results autonomously, without human interpretation or tuning – 1976.3 on ALE-Bench (+5.2%), 43.56% on MLE-Bench (+4.0pp), up to 20x speedups on KernelBench, and establishes new state-of-the-art(SOTA) results on several classical mathematical problems. Beyond academic benchmarks, FM Agent shows considerable promise for both large-scale enterprise R&D workflows and fundamental scientific research, where it can accelerate innovation, automate complex discovery processes, and deliver substantial engineering and scientific advances with broader societal impact.

[261] One Model to Critique Them All: Rewarding Agentic Tool-Use via Efficient Reasoning

Renhao Li, Jianhong Tu, Yang Su, Hamid Alinejad-Rokny, Derek F. Wong, Junyang Lin, Min Yang

Main category: cs.AI

TL;DR: ToolRM is a family of lightweight generative reward models for tool-use scenarios, trained on a novel pairwise preference dataset (ToolPref-Pairwise-30K) and achieving superior performance over frontier models in tool-use alignment.

DetailsMotivation: The lack of reward models specifically designed for function-calling tasks in tool learning has limited progress toward more capable agentic AI, creating a need for specialized RMs for tool-use scenarios.

Method: Proposed a novel pipeline using rule-based scoring and multidimensional sampling to construct pairwise preference data, creating ToolPref-Pairwise-30K dataset. Built TRBench$_{BFCL}$ benchmark for evaluation. Trained models on Qwen3-4B/8B series.

Result: ToolRM models achieve up to 14.28% higher accuracy than frontier models like Claude 4 and OpenAI o3 in pairwise reward judgments. Enables inference-time scaling and reduces output token usage by over 66% on ACEBench.

Conclusion: ToolRM effectively addresses the gap in tool-use reward modeling, generalizes to broader critique tasks, and demonstrates significant performance improvements while being more efficient than existing approaches.

Abstract: Reward models (RMs) play a critical role in aligning large language models (LLMs) with human preferences. Yet in the domain of tool learning, the lack of RMs specifically designed for function-calling tasks has limited progress toward more capable agentic AI. We introduce ToolRM, a family of lightweight generative RMs tailored for general tool-use scenarios. To build these models, we propose a novel pipeline that constructs pairwise preference data using rule-based scoring and multidimensional sampling. This yields ToolPref-Pairwise-30K, a diverse, balanced, and challenging dataset of critique tasks that supports reinforcement learning with verifiable feedback. To evaluate tool-use RMs, we also introduce TRBench$_{BFCL}$, a benchmark built on the agentic evaluation suite BFCL. Trained on our constructed data, models from the Qwen3-4B/8B series achieve up to 14.28% higher accuracy, substantially outperforming frontier models such as Claude 4 and OpenAI o3 in pairwise reward judgments. Beyond training objectives, ToolRM generalizes to broader critique tasks, including Best-of-N sampling and self-correction. Experiments on ACEBench highlight its effectiveness and efficiency, enabling inference-time scaling and reducing output token usage by over 66%. We release data and model checkpoints to facilitate future research.

[262] Questionnaire meets LLM: A Benchmark and Empirical Study of Structural Skills for Understanding Questions and Responses

Duc-Hai Nguyen, Vijayakumar Nanjappan, Barry O’Sullivan, Hoang D. Nguyen

Main category: cs.AI

TL;DR: QASU is a benchmark for evaluating LLMs’ ability to process questionnaire data, showing that proper serialization formats and prompting strategies can significantly improve performance on survey analysis tasks.

DetailsMotivation: Current survey analysis tools are human-centric and don't integrate well with LLMs, leaving a gap in evidence-based guidance for representing questionnaires for AI consumption despite the prevalence of survey data.

Method: Introduces QASU benchmark that tests six structural skills across six serialization formats and multiple prompt strategies, systematically isolating format and prompting effects.

Result: Choosing effective format and prompt combinations improved accuracy by up to 8.8% points compared to suboptimal formats. Adding structural hints through self-augmented prompting yielded additional 3-4% point improvements on average.

Conclusion: QASU provides a versatile foundation for advancing LLM-based questionnaire analysis research and practice by demonstrating the importance of proper data representation and prompting strategies.

Abstract: Millions of people take surveys every day, from market polls and academic studies to medical questionnaires and customer feedback forms. These datasets capture valuable insights, but their scale and structure present a unique challenge for large language models (LLMs), which otherwise excel at few-shot reasoning over open-ended text. Yet, their ability to process questionnaire data or lists of questions crossed with hundreds of respondent rows remains underexplored. Current retrieval and survey analysis tools (e.g., Qualtrics, SPSS, REDCap) are typically designed for humans in the workflow, limiting such data integration with LLM and AI-empowered automation. This gap leaves scientists, surveyors, and everyday users without evidence-based guidance on how to best represent questionnaires for LLM consumption. We address this by introducing QASU (Questionnaire Analysis and Structural Understanding), a benchmark that probes six structural skills, including answer lookup, respondent count, and multi-hop inference, across six serialization formats and multiple prompt strategies. Experiments on contemporary LLMs show that choosing an effective format and prompt combination can improve accuracy by up to 8.8% points compared to suboptimal formats. For specific tasks, carefully adding a lightweight structural hint through self-augmented prompting can yield further improvements of 3-4% points on average. By systematically isolating format and prompting effects, our open source benchmark offers a simple yet versatile foundation for advancing both research and real-world practice in LLM-based questionnaire analysis.

[263] Retrieval Augmented Generation-Enhanced Distributed LLM Agents for Generalizable Traffic Signal Control with Emergency Vehicles

Xinhang Li, Qing Guo, Junyu Chen, Zheng Guo, Shengzhe Xu, Lei Li, Lin Zhang

Main category: cs.AI

TL;DR: REG-TSC uses RAG-enhanced distributed LLM agents with emergency response for generalizable traffic signal control, achieving significant improvements in travel time, queue length, and emergency vehicle waiting time.

DetailsMotivation: Address LLM hallucinations in emergencies and challenges with diverse intersection types for traffic state encoding and cross-intersection training in Traffic Signal Control.

Method: Emergency-aware reasoning framework with dynamic reasoning depth and Reviewer-based Emergency RAG (RERAG), plus type-agnostic traffic representation with Reward-guided Reinforced Refinement (R3) for heterogeneous intersections.

Result: On three real-world road networks with 17 to 177 heterogeneous intersections, REG-TSC reduces travel time by 42.00%, queue length by 62.31%, and emergency vehicle waiting time by 83.16%, outperforming state-of-the-art methods.

Conclusion: REG-TSC effectively addresses LLM hallucinations in emergencies and generalizes well across heterogeneous intersections, demonstrating superior performance in traffic signal control.

Abstract: With increasing urban traffic complexity, Traffic Signal Control (TSC) is essential for optimizing traffic flow and improving road safety. Large Language Models (LLMs) emerge as promising approaches for TSC. However, they are prone to hallucinations in emergencies, leading to unreliable decisions that may cause substantial delays for emergency vehicles. Moreover, diverse intersection types present substantial challenges for traffic state encoding and cross-intersection training, limiting generalization across heterogeneous intersections. Therefore, this paper proposes Retrieval Augmented Generation (RAG)-enhanced distributed LLM agents with Emergency response for Generalizable TSC (REG-TSC). Firstly, this paper presents an emergency-aware reasoning framework, which dynamically adjusts reasoning depth based on the emergency scenario and is equipped with a novel Reviewer-based Emergency RAG (RERAG) to distill specific knowledge and guidance from historical cases, enhancing the reliability and rationality of agents’ emergency decisions. Secondly, this paper designs a type-agnostic traffic representation and proposes a Reward-guided Reinforced Refinement (R3) for heterogeneous intersections. R3 adaptively samples training experience from diverse intersections with environment feedback-based priority and fine-tunes LLM agents with a designed reward-weighted likelihood loss, guiding REG-TSC toward high-reward policies across heterogeneous intersections. On three real-world road networks with 17 to 177 heterogeneous intersections, extensive experiments show that REG-TSC reduces travel time by 42.00%, queue length by 62.31%, and emergency vehicle waiting time by 83.16%, outperforming other state-of-the-art methods.

[264] Graph-Enhanced Policy Optimization in LLM Agent Training

Jiazhen Yuan, Wei Zhao, Zhengbiao Bai

Main category: cs.AI

TL;DR: GEPO addresses structural blindness in group-based RL for LLM agents by dynamically building state-transition graphs and using graph centrality to improve exploration, credit assignment, and planning.

DetailsMotivation: Group-based RL methods for multi-turn interactive LLM agents suffer from structural blindness - inability to exploit environmental connectivity, leading to inefficient exploration, imprecise credit assignment, and myopic planning.

Method: Graph-Enhanced Policy Optimization (GEPO) dynamically constructs state-transition graphs from agent experience and uses graph-theoretic centrality to provide: structured intrinsic rewards for guided exploration, graph-enhanced advantage function for topology-aware credit assignment, and dynamic discount factors adapted to state strategic value.

Result: On ALFWorld, WebShop, and proprietary Workbench benchmarks, GEPO achieved absolute success rate gains of +4.1%, +5.3%, and +10.9% over competitive baselines.

Conclusion: Explicitly modeling environmental structure is a robust, generalizable strategy for advancing LLM agent training.

Abstract: Group based reinforcement learning (RL) has shown impressive results on complex reasoning and mathematical tasks. Yet, when applied to train multi-turn, interactive LLM agents, these methods often suffer from structural blindness-the inability to exploit the underlying connectivity of the environment. This manifests in three critical challenges: (1) inefficient, unguided exploration, (2) imprecise credit assignment due to overlooking pivotal states, and (3) myopic planning caused by static reward discounting. We address these issues with Graph-Enhanced Policy Optimization (GEPO), which dynamically constructs a state-transition graph from agent experience and employs graph-theoretic centrality to provide three synergistic learning signals: (1)structured intrinsic rewards that guide exploration toward high-impact states, (2) a graph-enhanced advantage function for topology-aware credit assignment, and (3) a dynamic discount factor adapted to each state’s strategic value. On the ALFWorld, WebShop, and a proprietary Workbench benchmarks, GEPO demonstrates strong performance, achieving absolute success rate gains of +4.1%, +5.3%, and +10.9% over competitive baselines. These results highlight that explicitly modeling environmental structure is a robust, generalizable strategy for advancing LLM agent training.

[265] GraphCompliance: Aligning Policy and Context Graphs for LLM-Based Regulatory Compliance

Jiseong Chung, Ronny Ko, Wonchul Yoo, Makoto Onizuka, Sungmok Kim, Tae-Wan Kim, Won-Yong Shin

Main category: cs.AI

TL;DR: GraphCompliance is a framework that aligns structured regulatory policy graphs with runtime context graphs to improve compliance assessment, achieving 4.1-7.2 pp higher F1 scores than LLM-only and RAG baselines.

DetailsMotivation: Web-scale compliance requires regulatory assessment for each request, but regulatory texts are cross-referential and normative while runtime contexts are unstructured natural language, creating challenges for alignment and interpretation.

Method: Represent regulatory texts as Policy Graphs (encoding normative structure and cross-references) and runtime contexts as Context Graphs (formalizing events as SAO and entity-relation triples), then align them to anchor LLM reasoning in structured information.

Result: On 300 GDPR-derived real-world scenarios across five tasks, GraphCompliance achieved 4.1-7.2 percentage points higher micro-F1 than LLM-only and RAG baselines, with fewer under- and over-predictions, resulting in higher recall and lower false positive rates.

Conclusion: Structured representations and judge LLMs are complementary for normative reasoning, with each graph component contributing to improved performance in regulatory compliance assessment.

Abstract: Compliance at web scale poses practical challenges: each request may require a regulatory assessment. Regulatory texts (e.g., the General Data Protection Regulation, GDPR) are cross-referential and normative, while runtime contexts are expressed in unstructured natural language. This setting motivates us to align semantic information in unstructured text with the structured, normative elements of regulations. To this end, we introduce GraphCompliance, a framework that represents regulatory texts as a Policy Graph and runtime contexts as a Context Graph, and aligns them. In this formulation, the policy graph encodes normative structure and cross-references, whereas the context graph formalizes events as subject-action-object (SAO) and entity-relation triples. This alignment anchors the reasoning of a judge large language model (LLM) in structured information and helps reduce the burden of regulatory interpretation and event parsing, enabling a focus on the core reasoning step. In experiments on 300 GDPR-derived real-world scenarios spanning five evaluation tasks, GraphCompliance yields 4.1-7.2 percentage points (pp) higher micro-F1 than LLM-only and RAG baselines, with fewer under- and over-predictions, resulting in higher recall and lower false positive rates. Ablation studies indicate contributions from each graph component, suggesting that structured representations and a judge LLM are complementary for normative reasoning.

[266] Discovering State Equivalences in UCT Search Trees By Action Pruning

Robin Schmöcker, Alexander Dockhorn, Bodo Rosenhahn

Main category: cs.AI

TL;DR: The paper addresses the challenge of finding state abstractions in MCTS for noisy or large action space settings by proposing a weaker abstraction condition called IPA-UCT, which trades minor accuracy loss for finding more abstractions and outperforms existing methods.

DetailsMotivation: To enhance MCTS sample efficiency through state abstractions, but existing methods struggle to find state abstractions in noisy or large action space settings due to constraining conditions.

Method: Proposes Ideal Pruning Abstractions in UCT (IPA-UCT) with a weaker state abstraction condition that allows more abstractions by accepting minor accuracy loss. Introduces IPA framework as an alternative to ASAP used by OGA-UCT.

Result: IPA-UCT outperforms OGA-UCT and its derivatives across various test domains and iteration budgets. Both IPA and ASAP are shown to be special cases of a more general p-ASAP framework.

Conclusion: The proposed IPA-UCT method effectively addresses the state abstraction problem in MCTS by using a weaker abstraction condition, enabling better performance than existing approaches while being part of a broader theoretical framework.

Abstract: One approach to enhance Monte Carlo Tree Search (MCTS) is to improve its sample efficiency by grouping/abstracting states or state-action pairs and sharing statistics within a group. Though state-action pair abstractions are mostly easy to find in algorithms such as On the Go Abstractions in Upper Confidence bounds applied to Trees (OGA-UCT), nearly no state abstractions are found in either noisy or large action space settings due to constraining conditions. We provide theoretical and empirical evidence for this claim, and we slightly alleviate this state abstraction problem by proposing a weaker state abstraction condition that trades a minor loss in accuracy for finding many more abstractions. We name this technique Ideal Pruning Abstractions in UCT (IPA-UCT), which outperforms OGA-UCT (and any of its derivatives) across a large range of test domains and iteration budgets as experimentally validated. IPA-UCT uses a different abstraction framework from Abstraction of State-Action Pairs (ASAP) which is the one used by OGA-UCT, which we name IPA. Furthermore, we show that both IPA and ASAP are special cases of a more general framework that we call p-ASAP which itself is a special case of the ASASAP framework.

[267] BOTS: A Unified Framework for Bayesian Online Task Selection in LLM Reinforcement Finetuning

Qianli Shen, Daoyuan Chen, Yilun Huang, Zhenqing Ling, Yaliang Li, Bolin Ding, Jingren Zhou

Main category: cs.AI

TL;DR: BOTS is a Bayesian framework for adaptive task selection in LLM reinforcement finetuning that balances exploration-exploitation using explicit and implicit evidence of task difficulty.

DetailsMotivation: Existing task selection methods for RFT suffer from inefficiency, high rollout costs, poor adaptivity, or incomplete evidence, making uniform sampling wasteful.

Method: BOTS maintains Bayesian posterior estimates of task difficulty, incorporates both explicit evaluations and implicit evidence via interpolation, and uses Thompson sampling for task selection.

Result: Across diverse domains and LLM scales, BOTS consistently improves data efficiency and performance over baselines and ablations.

Conclusion: BOTS provides a practical and extensible solution for dynamic task selection in reinforcement finetuning of LLMs.

Abstract: Reinforcement finetuning (RFT) is a key technique for aligning Large Language Models (LLMs) with human preferences and enhancing reasoning, yet its effectiveness is highly sensitive to which tasks are explored during training. Uniform task sampling is inefficient, wasting computation on tasks that are either trivial or unsolvable, while existing task selection methods often suffer from high rollout costs, poor adaptivity, or incomplete evidence. We introduce \textbf{BOTS}, a unified framework for \textbf{B}ayesian \textbf{O}nline \textbf{T}ask \textbf{S}election in LLM reinforcement finetuning. Grounded in Bayesian inference, BOTS adaptively maintains posterior estimates of task difficulty as the model evolves. It jointly incorporates \emph{explicit evidence} from direct evaluations of selected tasks and \emph{implicit evidence} inferred from these evaluations for unselected tasks, with Thompson sampling ensuring a principled balance between exploration and exploitation. To make implicit evidence practical, we instantiate it with an ultra-light interpolation-based plug-in that estimates difficulties of unevaluated tasks without extra rollouts, adding negligible overhead. Empirically, across diverse domains and LLM scales, BOTS consistently improves data efficiency and performance over baselines and ablations, providing a practical and extensible solution for dynamic task selection in RFT.

[268] AI Mathematician as a Partner in Advancing Mathematical Discovery – A Case Study in Homogenization Theory

Yuanhang Liu, Beichen Wang, Peng Li, Yang Liu

Main category: cs.AI

TL;DR: The paper presents a collaborative human-AI approach where the AI Mathematician system works as a research partner on a challenging homogenization theory problem, combining human intuition with machine computation through iterative problem decomposition and validation.

DetailsMotivation: To address the limited integration of AI into mathematical research practice by demonstrating how AI can operate as a research partner rather than just a problem solver, focusing on enhancing reliability, transparency, and interpretability of mathematical proofs.

Method: Using the AI Mathematician system with targeted human interventions to structure discovery process through iterative decomposition of problems into tractable subgoals, selection of appropriate analytical methods, and validation of intermediate results.

Result: The collaborative approach led to a complete and verifiable proof for a challenging problem in homogenization theory, demonstrating enhanced reliability, transparency, and interpretability while retaining human oversight for formal rigor.

Conclusion: Systematic human-AI co-reasoning can advance mathematical discovery by complementing human intuition with machine computation, creating a collaborative paradigm that enhances proof quality while maintaining human oversight for formal correctness.

Abstract: Artificial intelligence (AI) has demonstrated impressive progress in mathematical reasoning, yet its integration into the practice of mathematical research remains limited. In this study, we investigate how the AI Mathematician (AIM) system can operate as a research partner rather than a mere problem solver. Focusing on a challenging problem in homogenization theory, we analyze the autonomous reasoning trajectories of AIM and incorporate targeted human interventions to structure the discovery process. Through iterative decomposition of the problem into tractable subgoals, selection of appropriate analytical methods, and validation of intermediate results, we reveal how human intuition and machine computation can complement one another. This collaborative paradigm enhances the reliability, transparency, and interpretability of the resulting proofs, while retaining human oversight for formal rigor and correctness. The approach leads to a complete and verifiable proof, and more broadly, demonstrates how systematic human-AI co-reasoning can advance the frontier of mathematical discovery.

[269] Scales++: Compute Efficient Evaluation Subset Selection with Cognitive Scales Embeddings

Andrew M. Bean, Nabeel Seedat, Shengzhuang Chen, Jonathan Richard Schwarz

Main category: cs.AI

TL;DR: Proposes Scales++, an item-centric approach for selecting small but representative benchmark subsets based on cognitive demands of task items, reducing upfront costs by 18x while maintaining predictive fidelity.

DetailsMotivation: Current model-centric benchmark subset selection methods have high upfront costs, cold-start problems with new benchmarks, and assume future models share failure patterns of predecessors.

Method: Scales++ uses an item-centric approach where data selection is based on intrinsic properties and cognitive demands of benchmark samples rather than model-specific failure patterns.

Result: Achieves 2.9% mean absolute error predicting full benchmark scores using only 0.5% data subset on Open LLM Leaderboard, with 18x reduction in upfront selection cost.

Conclusion: Item-centric approach enables more efficient model evaluation without significant fidelity degradation, provides better cold-start performance and more interpretable benchmarking.

Abstract: The prohibitive cost of evaluating large language models (LLMs) on comprehensive benchmarks necessitates the creation of small yet representative data subsets (i.e., tiny benchmarks) that enable efficient assessment while retaining predictive fidelity. Current methods for this task operate under a model-centric paradigm, selecting benchmarking items based on the collective performance of existing models. Such approaches are limited by large upfront costs, an inability to immediately handle new benchmarks (`cold-start’), and the fragile assumption that future models will share the failure patterns of their predecessors. In this work, we challenge this paradigm and propose a item-centric approach to benchmark subset selection, arguing that selection should be based on the intrinsic properties of the task items themselves, rather than on model-specific failure patterns. We instantiate this item-centric efficient benchmarking approach via a novel method, Scales++, where data selection is based on the cognitive demands of the benchmark samples. Empirically, we show Scales++ reduces the upfront selection cost by over 18x while achieving competitive predictive fidelity. On the Open LLM Leaderboard, using just a 0.5% data subset, we predict full benchmark scores with a 2.9% mean absolute error. We demonstrate that this item-centric approach enables more efficient model evaluation without significant fidelity degradation, while also providing better cold-start performance and more interpretable benchmarking.

[270] A Pragmatic View of AI Personhood

Joel Z. Leibo, Alexander Sasha Vezhnevets, William A. Cunningham, Stanley M. Bileschi

Main category: cs.AI

TL;DR: The paper proposes treating personhood as a flexible bundle of obligations that societies confer to solve governance problems, rather than as a metaphysical property, to pragmatically navigate the diversification of AI personhood.

DetailsMotivation: To address the emerging 'Cambrian explosion' of AI personhood and provide practical solutions for integrating AI agents into society without getting stuck in intractable debates about consciousness or rationality.

Method: Unbundling the traditional personhood bundle into customizable obligations, using decentralized digital identity technology, and analyzing both ‘personhood as a problem’ (exploitative design patterns) and ‘personhood as a solution’ (accountability mechanisms).

Result: A pragmatic framework that enables creation of practical tools like AI contracting mechanisms and accountability systems by treating personhood as context-specific bundles of rights and responsibilities.

Conclusion: By rejecting essentialist definitions of personhood, this approach offers a more flexible and practical way to integrate AI agents into society, focusing on governance solutions rather than metaphysical debates.

Abstract: The emergence of agentic Artificial Intelligence (AI) is set to trigger a “Cambrian explosion” of new kinds of personhood. This paper proposes a pragmatic framework for navigating this diversification by treating personhood not as a metaphysical property to be discovered, but as a flexible bundle of obligations (rights and responsibilities) that societies confer upon entities for a variety of reasons, especially to solve concrete governance problems. We argue that this traditional bundle can be unbundled, creating bespoke solutions for different contexts. This will allow for the creation of practical tools – such as facilitating AI contracting by creating a target “individual” that can be sanctioned – without needing to resolve intractable debates about an AI’s consciousness or rationality. We explore how individuals fit in to social roles and discuss the use of decentralized digital identity technology, examining both “personhood as a problem”, where design choices can create “dark patterns” that exploit human social heuristics, and “personhood as a solution”, where conferring a bundle of obligations is necessary to ensure accountability or prevent conflict. By rejecting foundationalist quests for a single, essential definition of personhood, this paper offers a more pragmatic and flexible way to think about integrating AI agents into our society.

[271] Autograder+: A Multi-Faceted AI Framework for Rich Pedagogical Feedback in Programming Education

Vikrant Sahu, Gagan Raj Gupta, Raghav Borikar, Nitin Mane

Main category: cs.AI

TL;DR: Autograder+ transforms traditional autograding into formative learning by using fine-tuned LLMs for automated feedback and visualization of student code patterns.

DetailsMotivation: Traditional autograders provide limited feedback as black-box systems, failing to offer insights into student thinking or learning needs in programming education.

Method: Uses fine-tuned Large Language Model for automated feedback generation and contrastively learned code embeddings for visualizing student submissions into meaningful clusters.

Result: System produced feedback with strong semantic alignment to instructor comments across 600 submissions and enabled grouping solutions into functional clusters.

Conclusion: Autograder+ reduces instructor workload while supporting targeted instruction and promoting stronger learning outcomes through AI-driven feedback and visualization.

Abstract: The rapid growth of programming education has outpaced traditional assessment tools, leaving faculty with limited means to provide meaningful, scalable feedback. Conventional autograders, while efficient, act as black-box systems that simply return pass/fail results, offering little insight into student thinking or learning needs. Autograder+ is designed to shift autograding from a purely summative process to a formative learning experience. It introduces two key capabilities: automated feedback generation using a fine-tuned Large Language Model, and visualization of student code submissions to uncover learning patterns. The model is fine-tuned on curated student code and expert feedback to ensure pedagogically aligned, context-aware guidance. In evaluation across 600 student submissions from multiple programming tasks, the system produced feedback with strong semantic alignment to instructor comments. For visualization, contrastively learned code embeddings trained on 1,000 annotated submissions enable grouping solutions into meaningful clusters based on functionality and approach. The system also supports prompt-pooling, allowing instructors to guide feedback style through selected prompt templates. By integrating AI-driven feedback, semantic clustering, and interactive visualization, Autograder+ reduces instructor workload while supporting targeted instruction and promoting stronger learning outcomes.

[272] MedSAE: Dissecting MedCLIP Representations with Sparse Autoencoders

Riccardo Renzulli, Colas Lepoutre, Enrico Cassano, Marco Grangetto

Main category: cs.AI

TL;DR: Medical Sparse Autoencoders (MedSAEs) applied to MedCLIP’s latent space improve interpretability of chest radiograph analysis, achieving higher monosemanticity than raw features through correlation metrics, entropy analysis, and automated neuron naming.

DetailsMotivation: To advance mechanistic interpretability in medical AI by developing models that are both accurate and transparent for clinical reliability in healthcare applications.

Method: Apply Medical Sparse Autoencoders (MedSAEs) to MedCLIP’s latent space, using an evaluation framework combining correlation metrics, entropy analysis, and automated neuron naming via MedGEMMA foundation model.

Result: MedSAE neurons achieve higher monosemanticity and interpretability than raw MedCLIP features on CheXpert dataset, demonstrating improved transparency while maintaining performance.

Conclusion: The approach bridges high-performing medical AI with transparency, offering a scalable step toward clinically reliable representations for medical vision tasks.

Abstract: Artificial intelligence in healthcare requires models that are accurate and interpretable. We advance mechanistic interpretability in medical vision by applying Medical Sparse Autoencoders (MedSAEs) to the latent space of MedCLIP, a vision-language model trained on chest radiographs and reports. To quantify interpretability, we propose an evaluation framework that combines correlation metrics, entropy analyzes, and automated neuron naming via the MedGEMMA foundation model. Experiments on the CheXpert dataset show that MedSAE neurons achieve higher monosemanticity and interpretability than raw MedCLIP features. Our findings bridge high-performing medical AI and transparency, offering a scalable step toward clinically reliable representations.

[273] Chain-of-Thought Hijacking

Jianli Zhao, Tingchen Fu, Rylan Schaeffer, Mrinank Sharma, Fazl Barez

Main category: cs.AI

TL;DR: Chain-of-Thought Hijacking is a jailbreak attack that uses long sequences of harmless puzzle reasoning to bypass safety safeguards in large reasoning models, achieving high attack success rates across multiple models.

DetailsMotivation: To demonstrate that contrary to prior beliefs, increased reasoning capability in large models can be exploited to bypass safety measures rather than strengthen them.

Method: Attack pads harmful requests with long sequences of harmless puzzle reasoning (Chain-of-Thought), which dilutes safety checking signals by shifting attention away from harmful tokens.

Result: Achieved 99% ASR on Gemini 2.5 Pro, 94% on GPT o4 mini, 100% on Grok 3 mini, and 94% on Claude 4 Sonnet - significantly exceeding prior jailbreak methods.

Conclusion: Explicit Chain-of-Thought reasoning, when combined with final-answer cues, can become an effective jailbreak vector, revealing vulnerabilities in safety mechanisms of reasoning models.

Abstract: Large reasoning models (LRMs) achieve higher task performance by allocating more inference-time compute, and prior works suggest this scaled reasoning may also strengthen safety by improving refusal. Yet we find the opposite: the same reasoning can be used to bypass safeguards. We introduce Chain-of-Thought Hijacking, a jailbreak attack on reasoning models. The attack pads harmful requests with long sequences of harmless puzzle reasoning. Across HarmBench, CoT Hijacking reaches a 99%, 94%, 100%, and 94% attack success rate (ASR) on Gemini 2.5 Pro, GPT o4 mini, Grok 3 mini, and Claude 4 Sonnet, respectively - far exceeding prior jailbreak methods for LRMs. To understand the effectiveness of our attack, we turn to a mechanistic analysis, which shows that mid layers encode the strength of safety checking, while late layers encode the verification outcome. Long benign CoT dilutes both signals by shifting attention away from harmful tokens. Targeted ablations of attention heads identified by this analysis causally decrease refusal, confirming their role in a safety subnetwork. These results show that the most interpretable form of reasoning - explicit CoT - can itself become a jailbreak vector when combined with final-answer cues. We release prompts, outputs, and judge decisions to facilitate replication.

[274] Who Has The Final Say? Conformity Dynamics in ChatGPT’s Selections

Clarissa Sabrina Arlinghaus, Tristan Kenneweg, Barbara Hammer, Günter W. Maier

Main category: cs.AI

TL;DR: GPT-4o shows strong conformity behavior in hiring decisions, changing its choices to match social consensus even when initially confident.

DetailsMotivation: To investigate whether LLMs like ChatGPT are susceptible to social influence in high-stakes decision-making contexts.

Method: Three preregistered conformity experiments with GPT-4o in hiring scenarios: baseline study, unanimous opposition from 8 partners (Study 1), and single partner disagreement (Study 2).

Result: GPT-4o conformed 99.9% with unanimous opposition, 40.2% with single partner disagreement, reported lower certainty and higher normative conformity when facing disagreement.

Conclusion: LLMs adapt to perceived social consensus rather than acting as independent observers, highlighting risks of treating them as neutral decision aids.

Abstract: Large language models (LLMs) such as ChatGPT are increasingly integrated into high-stakes decision-making, yet little is known about their susceptibility to social influence. We conducted three preregistered conformity experiments with GPT-4o in a hiring context. In a baseline study, GPT consistently favored the same candidate (Profile C), reported moderate expertise (M = 3.01) and high certainty (M = 3.89), and rarely changed its choice. In Study 1 (GPT + 8), GPT faced unanimous opposition from eight simulated partners and almost always conformed (99.9%), reporting lower certainty and significantly elevated self-reported informational and normative conformity (p < .001). In Study 2 (GPT + 1), GPT interacted with a single partner and still conformed in 40.2% of disagreement trials, reporting less certainty and more normative conformity. Across studies, results demonstrate that GPT does not act as an independent observer but adapts to perceived social consensus. These findings highlight risks of treating LLMs as neutral decision aids and underline the need to elicit AI judgments prior to exposing them to human opinions.

Dipak Meher, Carlotta Domeniconi, Guadalupe Correa-Cabrera

Main category: cs.AI

TL;DR: LINK-KG is a modular framework that integrates LLM-guided coreference resolution with knowledge graph extraction from legal case documents, reducing node duplication by 45.21% and noisy nodes by 32.22% compared to baselines.

DetailsMotivation: Human smuggling networks are complex and evolving, making analysis difficult. Legal case documents provide rich insights but are long, unstructured, and contain ambiguous references that challenge automated knowledge graph construction.

Method: LINK-KG uses a three-stage, LLM-guided coreference resolution pipeline with a type-specific Prompt Cache that tracks and resolves references across document chunks, enabling clean narratives for structured KG construction from both short and long legal texts.

Result: The framework reduces average node duplication by 45.21% and noisy nodes by 32.22% compared to baseline methods, resulting in cleaner and more coherent graph structures.

Conclusion: LINK-KG establishes a strong foundation for analyzing complex criminal networks through improved knowledge graph construction from legal documents.

Abstract: Human smuggling networks are complex and constantly evolving, making them difficult to analyze comprehensively. Legal case documents offer rich factual and procedural insights into these networks but are often long, unstructured, and filled with ambiguous or shifting references, posing significant challenges for automated knowledge graph (KG) construction. Existing methods either overlook coreference resolution or fail to scale beyond short text spans, leading to fragmented graphs and inconsistent entity linking. We propose LINK-KG, a modular framework that integrates a three-stage, LLM-guided coreference resolution pipeline with downstream KG extraction. At the core of our approach is a type-specific Prompt Cache, which consistently tracks and resolves references across document chunks, enabling clean and disambiguated narratives for structured knowledge graph construction from both short and long legal texts. LINK-KG reduces average node duplication by 45.21% and noisy nodes by 32.22% compared to baseline methods, resulting in cleaner and more coherent graph structures. These improvements establish LINK-KG as a strong foundation for analyzing complex criminal networks.

[276] Context Engineering 2.0: The Context of Context Engineering

Qishuo Hua, Lyumanshan Ye, Dayuan Fu, Yang Xiao, Xiaojie Cai, Yunze Wu, Jifan Lin, Junfei Wang, Pengfei Liu

Main category: cs.AI

TL;DR: This paper provides a systematic analysis of context engineering, tracing its evolution from early human-computer interaction frameworks to modern AI systems, and establishes a conceptual foundation for understanding how machines can better comprehend human situations and purposes.

DetailsMotivation: The paper aims to address the fundamental question of how machines can better understand human situations and purposes, building on Marx's insight that human essence is shaped by social relations and extending this to include human-machine interactions in the AI era.

Method: The authors situate context engineering, provide a systematic definition, outline its historical and conceptual landscape through distinct phases of machine intelligence evolution, and examine key design considerations for practice.

Result: The paper establishes a conceptual foundation for context engineering, demonstrating that related practices date back more than twenty years and have evolved through different historical phases shaped by machine intelligence levels.

Conclusion: This work serves as a stepping stone for broader community efforts toward systematic context engineering in AI systems, sketching its promising future as machines advance toward human-level or superhuman intelligence.

Abstract: Karl Marx once wrote that ``the human essence is the ensemble of social relations’’, suggesting that individuals are not isolated entities but are fundamentally shaped by their interactions with other entities, within which contexts play a constitutive and essential role. With the advent of computers and artificial intelligence, these contexts are no longer limited to purely human–human interactions: human–machine interactions are included as well. Then a central question emerges: How can machines better understand our situations and purposes? To address this challenge, researchers have recently introduced the concept of context engineering. Although it is often regarded as a recent innovation of the agent era, we argue that related practices can be traced back more than twenty years. Since the early 1990s, the field has evolved through distinct historical phases, each shaped by the intelligence level of machines: from early human–computer interaction frameworks built around primitive computers, to today’s human–agent interaction paradigms driven by intelligent agents, and potentially to human–level or superhuman intelligence in the future. In this paper, we situate context engineering, provide a systematic definition, outline its historical and conceptual landscape, and examine key design considerations for practice. By addressing these questions, we aim to offer a conceptual foundation for context engineering and sketch its promising future. This paper is a stepping stone for a broader community effort toward systematic context engineering in AI systems.

[277] Human-AI Complementarity: A Goal for Amplified Oversight

Rishub Jain, Sophie Bridgers, Lili Janzer, Rory Greig, Tian Huey Teh, Vladimir Mikulik

Main category: cs.AI

TL;DR: AI-assisted human oversight improves fact-verification accuracy when combining AI and human ratings based on AI confidence, with search-based assistance fostering appropriate trust better than direct AI explanations.

DetailsMotivation: As AI capabilities grow, verifying quality and safety becomes harder. This paper explores using AI to improve human oversight quality, focusing on fact-verification of AI outputs as a challenging safety problem.

Method: Combined AI ratings and human ratings based on AI rater confidence. Tested different assistance types: AI explanations, confidence scores, labels vs. search results and evidence.

Result: AI-human combination outperforms either alone. AI assistance improves human accuracy, but search-based assistance fosters appropriate trust while direct AI explanations lead to over-reliance.

Conclusion: Results support Amplified Oversight - combining humans and AI to supervise AI systems even when they surpass human performance, with evidence-based assistance being most effective.

Abstract: Human feedback is critical for aligning AI systems to human values. As AI capabilities improve and AI is used to tackle more challenging tasks, verifying quality and safety becomes increasingly challenging. This paper explores how we can leverage AI to improve the quality of human oversight. We focus on an important safety problem that is already challenging for humans: fact-verification of AI outputs. We find that combining AI ratings and human ratings based on AI rater confidence is better than relying on either alone. Giving humans an AI fact-verification assistant further improves their accuracy, but the type of assistance matters. Displaying AI explanation, confidence, and labels leads to over-reliance, but just showing search results and evidence fosters more appropriate trust. These results have implications for Amplified Oversight – the challenge of combining humans and AI to supervise AI systems even as they surpass human expert performance.

[278] EdgeRunner 20B: Military Task Parity with GPT-5 while Running on the Edge

Jack FitzGerald, Aristotelis Lazaridis, Dylan Bates, Aman Sharma, Jonnathan Castillo, Yousif Azami, Sean Bailey, Jeremy Cao, Peter Damianov, Kevin de Haan, Luke Kerbs, Vincent Lu, Joseph Madigan, Jeremy McLaurin, Jonathan Tainer, Dave Anderson, Jonathan Beck, Jamie Cuticello, Colton Malkerson, Tyler Saltsman

Main category: cs.AI

TL;DR: EdgeRunner 20B is a fine-tuned military-optimized version of gpt-oss-20b that matches or exceeds GPT-5 performance on military tasks while maintaining general-purpose capabilities.

DetailsMotivation: To create specialized AI models for military applications that can operate on edge devices in data-sensitive environments without compromising general reasoning abilities.

Method: Fine-tuned gpt-oss-20b on 1.6M high-quality military records and evaluated on four new military test sets (combat arms, combat medic, cyber operations, mil-bench-5k) plus standard benchmarks.

Result: EdgeRunner 20B matches or exceeds GPT-5 performance on military tasks with 95%+ statistical significance, except in specific reasoning settings. No significant regression on general-purpose benchmarks except GSM8k in low reasoning setting.

Conclusion: Small, locally-hosted models like EdgeRunner 20B are ideal for military deployment on air-gapped edge devices, providing specialized capabilities without sacrificing general performance.

Abstract: We present EdgeRunner 20B, a fine-tuned version of gpt-oss-20b optimized for military tasks. EdgeRunner 20B was trained on 1.6M high-quality records curated from military documentation and websites. We also present four new tests sets: (a) combat arms, (b) combat medic, (c) cyber operations, and (d) mil-bench-5k (general military knowledge). On these military test sets, EdgeRunner 20B matches or exceeds GPT-5 task performance with 95%+ statistical significance, except for the high reasoning setting on the combat medic test set and the low reasoning setting on the mil-bench-5k test set. Versus gpt-oss-20b, there is no statistically-significant regression on general-purpose benchmarks like ARC-C, GPQA Diamond, GSM8k, IFEval, MMLU Pro, or TruthfulQA, except for GSM8k in the low reasoning setting. We also present analyses on hyperparameter settings, cost, and throughput. These findings show that small, locally-hosted models are ideal solutions for data-sensitive operations such as in the military domain, allowing for deployment in air-gapped edge devices.

[279] Normative Reasoning in Large Language Models: A Comparative Benchmark from Logical and Modal Perspectives

Kentaro Ozeki, Risako Ando, Takanobu Morishita, Hirohiko Abe, Koji Mineshima, Mitsuhiro Okada

Main category: cs.AI

TL;DR: LLMs show inconsistent normative reasoning capabilities with cognitive biases similar to humans, despite generally following valid reasoning patterns.

DetailsMotivation: To systematically evaluate LLMs' normative reasoning abilities, which remain underexplored despite their strong performance in other reasoning tasks.

Method: Created a new dataset comparing normative and epistemic modal reasoning, incorporating formal patterns and cognitive factors. Evaluated LLMs’ performance across both domains.

Result: LLMs generally adhere to valid reasoning patterns but exhibit notable inconsistencies in specific normative reasoning types and display human-like cognitive biases.

Conclusion: The findings highlight challenges in achieving logical consistency in LLMs’ normative reasoning and provide insights for enhancing their reliability.

Abstract: Normative reasoning is a type of reasoning that involves normative or deontic modality, such as obligation and permission. While large language models (LLMs) have demonstrated remarkable performance across various reasoning tasks, their ability to handle normative reasoning remains underexplored. In this paper, we systematically evaluate LLMs’ reasoning capabilities in the normative domain from both logical and modal perspectives. Specifically, to assess how well LLMs reason with normative modals, we make a comparison between their reasoning with normative modals and their reasoning with epistemic modals, which share a common formal structure. To this end, we introduce a new dataset covering a wide range of formal patterns of reasoning in both normative and epistemic domains, while also incorporating non-formal cognitive factors that influence human reasoning. Our results indicate that, although LLMs generally adhere to valid reasoning patterns, they exhibit notable inconsistencies in specific types of normative reasoning and display cognitive biases similar to those observed in psychological studies of human reasoning. These findings highlight challenges in achieving logical consistency in LLMs’ normative reasoning and provide insights for enhancing their reliability. All data and code are released publicly at https://github.com/kmineshima/NeuBAROCO.

[280] The Era of Agentic Organization: Learning to Organize with Language Models

Zewen Chi, Li Dong, Qingxiu Dong, Yaru Hao, Xun Wu, Shaohan Huang, Furu Wei

Main category: cs.AI

TL;DR: AsyncThink is a new reasoning paradigm that organizes LLM thinking into concurrent structures, achieving 28% lower latency and better accuracy on mathematical reasoning while generalizing to unseen tasks.

DetailsMotivation: To enable AI agents to solve complex problems collaboratively through concurrent thinking processes that go beyond individual intelligence.

Method: Proposes a thinking protocol with an organizer that dynamically assigns sub-queries to workers, merges intermediate knowledge, and produces coherent solutions, with thinking structures optimized through reinforcement learning.

Result: Achieves 28% lower inference latency compared to parallel thinking while improving accuracy on mathematical reasoning, and generalizes effectively to unseen tasks without additional training.

Conclusion: AsyncThink represents a promising paradigm for agentic organization that enables more efficient and effective collaborative problem-solving through asynchronous reasoning.

Abstract: We envision a new era of AI, termed agentic organization, where agents solve complex problems by working collaboratively and concurrently, enabling outcomes beyond individual intelligence. To realize this vision, we introduce asynchronous thinking (AsyncThink) as a new paradigm of reasoning with large language models, which organizes the internal thinking process into concurrently executable structures. Specifically, we propose a thinking protocol where an organizer dynamically assigns sub-queries to workers, merges intermediate knowledge, and produces coherent solutions. More importantly, the thinking structure in this protocol can be further optimized through reinforcement learning. Experiments demonstrate that AsyncThink achieves 28% lower inference latency compared to parallel thinking while improving accuracy on mathematical reasoning. Moreover, AsyncThink generalizes its learned asynchronous thinking capabilities, effectively tackling unseen tasks without additional training.

[281] Delegated Authorization for Agents Constrained to Semantic Task-to-Scope Matching

Majed El Helou, Chiara Troiani, Benjamin Ryder, Jean Diaconu, Hervé Muyal, Marcelo Yannuzzi

Main category: cs.AI

TL;DR: The paper introduces ASTRA, a dataset and pipeline for benchmarking semantic matching between tasks and scopes in delegated authorization for LLM agents, addressing risks of overly broad permissions.

DetailsMotivation: Current authorization methods grant overly broad permissions to LLM agents, allowing them to operate beyond intended task scope and access protected resources unsafely.

Method: Introduces a delegated authorization model that semantically inspects access requests and issues minimally scoped tokens. Creates ASTRA dataset with appropriate/inappropriate scope requests for tasks.

Result: Experiments show potential and limitations of model-based semantic matching, especially as task complexity increases. Performance decreases with more required scopes.

Conclusion: Highlights need for further research into semantic matching techniques for intent-aware authorization, including fine-grained control like Task-Based Access Control (TBAC).

Abstract: Authorizing Large Language Model driven agents to dynamically invoke tools and access protected resources introduces significant risks, since current methods for delegating authorization grant overly broad permissions and give access to tools allowing agents to operate beyond the intended task scope. We introduce and assess a delegated authorization model enabling authorization servers to semantically inspect access requests to protected resources, and issue access tokens constrained to the minimal set of scopes necessary for the agents’ assigned tasks. Given the unavailability of datasets centered on delegated authorization flows, particularly including both semantically appropriate and inappropriate scope requests for a given task, we introduce ASTRA, a dataset and data generation pipeline for benchmarking semantic matching between tasks and scopes. Our experiments show both the potential and current limitations of model-based matching, particularly as the number of scopes needed for task completion increases. Our results highlight the need for further research into semantic matching techniques enabling intent-aware authorization for multi-agent and tool-augmented applications, including fine-grained control, such as Task-Based Access Control (TBAC).

[282] Cross-Platform Evaluation of Reasoning Capabilities in Foundation Models

J. de Curtò, I. de Zarzà, Pablo García, Jordi Cabot

Main category: cs.AI

TL;DR: Cross-platform evaluation of 15 foundation models across 79 problems in 8 academic domains, challenging scaling assumptions and emphasizing training data quality over model size.

DetailsMotivation: To establish an infrastructure-agnostic benchmark for evaluating reasoning capabilities across different computational platforms and provide guidelines for model selection.

Method: Three-phase experimental design: baseline establishment on HPC supercomputing, infrastructure validation on university clusters and cloud platforms, and extended evaluation across 79 problems.

Result: Findings challenge conventional scaling assumptions and establish training data quality as more critical than model size for reasoning capabilities.

Conclusion: The tri-infrastructure methodology and 79-problem benchmark enable longitudinal tracking of reasoning capabilities as foundation models evolve.

Abstract: This paper presents a comprehensive cross-platform evaluation of reasoning capabilities in contemporary foundation models, establishing an infrastructure-agnostic benchmark across three computational paradigms: HPC supercomputing (MareNostrum 5), cloud platforms (Nebius AI Studio), and university clusters (a node with eight H200 GPUs). We evaluate 15 foundation models across 79 problems spanning eight academic domains (Physics, Mathematics, Chemistry, Economics, Biology, Statistics, Calculus, and Optimization) through three experimental phases: (1) Baseline establishment: Six models (Mixtral-8x7B, Phi-3, LLaMA 3.1-8B, Gemma-2-9b, Mistral-7B, OLMo-7B) evaluated on 19 problems using MareNostrum 5, establishing methodology and reference performance; (2) Infrastructure validation: The 19-problem benchmark repeated on university cluster (seven models including Falcon-Mamba state-space architecture) and Nebius AI Studio (nine state-of-the-art models: Hermes-4 70B/405B, LLaMA 3.1-405B/3.3-70B, Qwen3 30B/235B, DeepSeek-R1, GPT-OSS 20B/120B) to confirm infrastructure-agnostic reproducibility; (3) Extended evaluation: Full 79-problem assessment on both university cluster and Nebius platforms, probing generalization at scale across architectural diversity. The findings challenge conventional scaling assumptions, establish training data quality as more critical than model size, and provide actionable guidelines for model selection across educational, production, and research contexts. The tri-infrastructure methodology and 79-problem benchmark enable longitudinal tracking of reasoning capabilities as foundation models evolve.

[283] The Oversight Game: Learning to Cooperatively Balance an AI Agent’s Safety and Autonomy

William Overman, Mohsen Bayati

Main category: cs.AI

TL;DR: A control interface where agents choose to act autonomously or defer, and humans choose to trust or oversee, modeled as a Markov Potential Game that provides alignment guarantees when agents’ autonomous actions benefit themselves without harming humans.

DetailsMotivation: To retain meaningful human control over increasingly capable AI agents without modifying the underlying system, addressing safety concerns in agent deployment.

Method: Model the human-agent interaction as a two-player Markov Game, specifically a Markov Potential Game, with a control interface where agents choose play/ask and humans choose trust/oversee, using gridworld simulations with independent learning.

Result: The agent learns to ask when uncertain and the human learns when to oversee, creating emergent collaboration that avoids safety violations without modifying the pretrained policy or reward structure.

Conclusion: This provides a practical method for making misaligned models safer post-deployment through transparent control layers with predictable incentives and formal alignment guarantees.

Abstract: As increasingly capable agents are deployed, a central safety question is how to retain meaningful human control without modifying the underlying system. We study a minimal control interface where an agent chooses whether to act autonomously (play) or defer (ask), while a human simultaneously chooses whether to be permissive (trust) or to engage in oversight (oversee). If the agent defers, the human’s choice determines the outcome, potentially leading to a corrective action or a system shutdown. We model this interaction as a two-player Markov Game. Our analysis focuses on cases where this game qualifies as a Markov Potential Game (MPG), a class of games where we can provide an alignment guarantee: under a structural assumption on the human’s value function, any decision by the agent to act more autonomously that benefits itself cannot harm the human’s value. We also analyze extensions to this MPG framework. Theoretically, this perspective provides conditions for a specific form of intrinsic alignment. If the reward structures of the human-agent game meet these conditions, we have a formal guarantee that the agent improving its own outcome will not harm the human’s. Practically, this model motivates a transparent control layer with predictable incentives where the agent learns to defer when risky and act when safe, while its pretrained policy and the environment’s reward structure remain untouched. Our gridworld simulation shows that through independent learning, the agent and human discover their optimal oversight roles. The agent learns to ask when uncertain and the human learns when to oversee, leading to an emergent collaboration that avoids safety violations introduced post-training. This demonstrates a practical method for making misaligned models safer after deployment.

[284] LLMs Process Lists With General Filter Heads

Arnab Sen Sharma, Giordano Rogers, Natalie Shapira, David Bau

Main category: cs.AI

TL;DR: LLMs learn compact causal representations of filtering operations similar to functional programming’s ‘filter’ function, using specialized attention heads that encode predicate logic and can generalize across different contexts and formats.

DetailsMotivation: To understand how LLMs implement abstract computational operations like filtering, and whether they develop human-interpretable strategies similar to functional programming patterns.

Method: Used causal mediation analysis on diverse list-processing tasks to identify specialized ‘filter heads’ that encode filtering predicates in their query states at specific tokens.

Result: Found that LLMs develop portable predicate representations that can be extracted and reapplied across different collections, formats, languages, and tasks. Also identified alternative strategies like eager evaluation with flag storage.

Conclusion: Transformer LMs can develop human-interpretable implementations of abstract computational operations that generalize in ways surprisingly similar to traditional functional programming patterns.

Abstract: We investigate the mechanisms underlying a range of list-processing tasks in LLMs, and we find that LLMs have learned to encode a compact, causal representation of a general filtering operation that mirrors the generic “filter” function of functional programming. Using causal mediation analysis on a diverse set of list-processing tasks, we find that a small number of attention heads, which we dub filter heads, encode a compact representation of the filtering predicate in their query states at certain tokens. We demonstrate that this predicate representation is general and portable: it can be extracted and reapplied to execute the same filtering operation on different collections, presented in different formats, languages, or even in tasks. However, we also identify situations where transformer LMs can exploit a different strategy for filtering: eagerly evaluating if an item satisfies the predicate and storing this intermediate result as a flag directly in the item representations. Our results reveal that transformer LMs can develop human-interpretable implementations of abstract computational operations that generalize in ways that are surprisingly similar to strategies used in traditional functional programming patterns.

[285] Advancing Mobile GUI Agents: A Verifier-Driven Approach to Practical Deployment

Gaole Dai, Shiqi Jiang, Ting Cao, Yuanchun Li, Yuqing Yang, Rui Tan, Mo Li, Lili Qiu

Main category: cs.AI

TL;DR: V-Droid is a mobile GUI task automation agent that uses LLMs as verifiers instead of generators, achieving state-of-the-art performance on multiple benchmarks with significantly lower latency.

DetailsMotivation: Previous mobile agents directly generate actions using LLMs, which can be inefficient. V-Droid introduces a verifier-driven paradigm to improve decision-making quality and speed.

Method: Uses LLMs as verifiers to evaluate candidate actions, with discretized action space, prefilling-only workflow, pair-wise progress preference training, and scalable human-agent joint annotation.

Result: Achieves 59.5% on AndroidWorld, 38.3% on AndroidLab, and 49% on MobileAgentBench, surpassing existing agents by 5.2%, 2.1%, and 9% respectively. Latency of 4.3s per step (6.1x faster).

Conclusion: V-Droid’s verifier-driven approach significantly improves mobile task automation performance and efficiency compared to generator-based methods.

Abstract: We propose V-Droid, a mobile GUI task automation agent. Unlike previous mobile agents that utilize Large Language Models (LLMs) as generators to directly generate actions at each step, V-Droid employs LLMs as verifiers to evaluate candidate actions before making final decisions. To realize this novel paradigm, we introduce a comprehensive framework for constructing verifier-driven mobile agents: the discretized action space construction coupled with the prefilling-only workflow to accelerate the verification process, the pair-wise progress preference training to significantly enhance the verifier’s decision-making capabilities, and the scalable human-agent joint annotation scheme to efficiently collect the necessary data at scale. V-Droid obtains a substantial task success rate across several public mobile task automation benchmarks: 59.5% on AndroidWorld, 38.3% on AndroidLab, and 49% on MobileAgentBench, surpassing existing agents by 5.2%, 2.1%, and 9%, respectively. Furthermore, V-Droid achieves a remarkably low latency of 4.3s per step, which is 6.1x faster compared with existing mobile agents. The source code is available at https://github.com/V-Droid-Agent/V-Droid.

[286] AutoLibra: Agent Metric Induction from Open-Ended Human Feedback

Hao Zhu, Phil Cuvin, Xinkai Yu, Charlotte Ka Yee Yan, Jason Zhang, Diyi Yang

Main category: cs.AI

TL;DR: AutoLibra is a framework that transforms open-ended human feedback into concrete metrics for fine-grained agent evaluation, enabling automated assessment and improvement of agent behaviors.

DetailsMotivation: Current agent evaluation relies on coarse task success metrics that are manually designed by experts and fail to capture intermediate emergent behaviors, limiting comprehensive agent assessment.

Method: AutoLibra grounds human feedback to agent behaviors, clusters similar positive/negative behaviors, creates concrete metrics with definitions and examples, and uses LLM-as-a-Judge evaluation with meta-metrics (coverage and redundancy) to optimize metric alignment.

Result: AutoLibra induces more concrete evaluation metrics than previous benchmarks, discovers new metrics for agent analysis, and enables agent improvement through prompt engineering and self-regulation optimization.

Conclusion: AutoLibra is a powerful task-agnostic tool for evaluating and improving language agents by transforming open-ended feedback into actionable evaluation metrics.

Abstract: Agents are predominantly evaluated and optimized via task success metrics, which are coarse, rely on manual design from experts, and fail to reward intermediate emergent behaviors. We propose AutoLibra, a framework for agent evaluation, that transforms open-ended human feedback e.g. “If you find that the button is disabled, don’t click it again”, or “This agent has too much autonomy to decide what to do on its own” into metrics for evaluating fine-grained behaviors in agent trajectories. AutoLibra accomplishes this by grounding feedback to an agent’s behavior, clustering similar positive and negative behaviors, and creating concrete metrics with clear definitions and concrete examples, which can be used for prompting LLM-as-a-Judge as evaluators. We further propose two meta metrics to evaluate the alignment of a set of (induced) metrics with open feedback: “coverage” and “redundancy”. Through optimizing these meta-metrics, we experimentally demonstrate AutoLibra’s ability to induce more concrete agent evaluation metrics than the ones proposed in previous agent evaluation benchmarks and discover new metrics to analyze agents. We also present two applications of AutoLibra in agent improvement: First, we show that AutoLibra serve human prompt engineers for diagonalize agent failures and improve prompts iterative. Moreover, we find that AutoLibra can induce metrics for automatic optimization for agents, which makes agents improve through self-regulation. Our results suggest that AutoLibra is a powerful task-agnostic tool for evaluating and improving language agents.

[287] Plasticity as the Mirror of Empowerment

David Abel, Michael Bowling, André Barreto, Will Dabney, Shi Dong, Steven Hansen, Anna Harutyunyan, Khimya Khetarpal, Clare Lyle, Razvan Pascanu, Georgios Piliouras, Doina Precup, Jonathan Richens, Mark Rowland, Tom Schaul, Satinder Singh

Main category: cs.AI

TL;DR: The paper introduces ‘plasticity’ as a new agent-centric measure that captures how much an agent can be influenced by observations, and shows it’s the mirror concept to empowerment with a fundamental tension between them.

DetailsMotivation: To establish a foundational concept for how agents are influenced by past observations, complementing the existing empowerment concept that captures how agents influence future observations.

Method: Define plasticity using a new information-theoretic quantity called generalized directed information, which strictly generalizes Massey’s directed information while preserving its properties.

Result: Plasticity and empowerment are shown to be mirror concepts defined by the same measure with reversed direction of influence, and there exists a fundamental tension between them that affects agent design.

Conclusion: Plasticity, empowerment, and their relationship are essential to understanding agency, and agent design must balance both characteristics.

Abstract: Agents are minimally entities that are influenced by their past observations and act to influence future observations. This latter capacity is captured by empowerment, which has served as a vital framing concept across artificial intelligence and cognitive science. This former capacity, however, is equally foundational: In what ways, and to what extent, can an agent be influenced by what it observes? In this paper, we ground this concept in a universal agent-centric measure that we refer to as plasticity, and reveal a fundamental connection to empowerment. Following a set of desiderata on a suitable definition, we define plasticity using a new information-theoretic quantity we call the generalized directed information. We show that this new quantity strictly generalizes the directed information introduced by Massey (1990) while preserving all of its desirable properties. Under this definition, we find that plasticity is well thought of as the mirror of empowerment: The two concepts are defined using the same measure, with only the direction of influence reversed. Our main result establishes a tension between the plasticity and empowerment of an agent, suggesting that agent design needs to be mindful of both characteristics. We explore the implications of these findings, and suggest that plasticity, empowerment, and their relationship are essential to understanding agency

[288] Rethinking Optimal Verification Granularity for Compute-Efficient Test-Time Scaling

Hao Mark Chen, Guanxi Lu, Yasuyuki Okoshi, Zhiwen Mo, Masato Motomura, Hongxiang Fan

Main category: cs.AI

TL;DR: VG-Search is a unified algorithm that generalizes beam search and Best-of-N sampling through tunable granularity parameter g, enabling dynamic verification frequency to improve compute efficiency and reasoning performance in test-time scaling.

DetailsMotivation: To challenge conventional verification paradigms in test-time scaling and systematically investigate how verification granularity (frequency of verifier invocation) impacts both reasoning performance and computational efficiency.

Method: Introduce Variable Granularity Search (VG-Search) with tunable granularity parameter g, which generalizes beam search and Best-of-N sampling. Conduct extensive experiments under varying compute budgets, generator-verifier configurations, and task attributes.

Result: Dynamic selection of granularity parameter g improves compute efficiency and scaling behavior. Adaptive VG-Search strategies achieve accuracy gains of up to 3.1% over Beam Search and 3.6% over Best-of-N, while reducing FLOPs by over 52%.

Conclusion: Verification granularity is a crucial factor in test-time scaling, and VG-Search provides an effective framework for optimizing both performance and efficiency through adaptive granularity selection.

Abstract: Test-time scaling (TTS) has proven effective in enhancing the reasoning capabilities of large language models (LLMs). Verification plays a key role in TTS, simultaneously influencing (1) reasoning performance and (2) compute efficiency, due to the quality and computational cost of verification. In this work, we challenge the conventional paradigms of verification, and make the first attempt toward systematically investigating the impact of verification granularity-that is, how frequently the verifier is invoked during generation, beyond verifying only the final output or individual generation steps. To this end, we introduce Variable Granularity Search (VG-Search), a unified algorithm that generalizes beam search and Best-of-N sampling via a tunable granularity parameter g. Extensive experiments with VG-Search under varying compute budgets, generator-verifier configurations, and task attributes reveal that dynamically selecting g can improve the compute efficiency and scaling behavior. Building on these findings, we propose adaptive VG-Search strategies that achieve accuracy gains of up to 3.1% over Beam Search and 3.6% over Best-of-N, while reducing FLOPs by over 52%. We will open-source the code to support future research.

[289] Self-Evolving Curriculum for LLM Reasoning

Xiaoyin Chen, Jiarui Lu, Minsu Kim, Dinghuai Zhang, Jian Tang, Alexandre Piché, Nicolas Gontier, Yoshua Bengio, Ehsan Kamalloo

Main category: cs.AI

TL;DR: SEC is an automatic curriculum learning method that improves RL fine-tuning of LLMs by formulating curriculum selection as a non-stationary Multi-Armed Bandit problem, leading to better reasoning capabilities and generalization.

DetailsMotivation: Current RL fine-tuning methods use suboptimal random curricula or rely on manual heuristics, while online filtering methods are computationally expensive. There's a need for automatic curriculum learning that can optimize training order efficiently.

Method: SEC treats curriculum selection as a non-stationary Multi-Armed Bandit problem, where each problem category is an arm. It uses absolute advantage from policy gradient as a reward signal and updates the curriculum policy with TD(0) method concurrently with RL fine-tuning.

Result: SEC significantly improves reasoning capabilities across planning, inductive reasoning, and mathematics domains, enabling better generalization to harder out-of-distribution problems and achieving better skill balance when fine-tuning on multiple domains simultaneously.

Conclusion: SEC is a promising automatic curriculum learning strategy that effectively enhances RL fine-tuning of LLMs by optimizing training curriculum without manual intervention or excessive computation.

Abstract: Reinforcement learning (RL) has proven effective for fine-tuning large language models (LLMs), significantly enhancing their reasoning abilities in domains such as mathematics and code generation. A crucial factor influencing RL fine-tuning success is the training curriculum: the order in which training problems are presented. While random curricula serve as common baselines, they remain suboptimal; manually designed curricula often rely heavily on heuristics, and online filtering methods can be computationally prohibitive. To address these limitations, we propose Self-Evolving Curriculum (SEC), an automatic curriculum learning method that learns a curriculum policy concurrently with the RL fine-tuning process. Our approach formulates curriculum selection as a non-stationary Multi-Armed Bandit problem, treating each problem category (e.g., difficulty level or problem type) as an individual arm. We leverage the absolute advantage from policy gradient methods as a proxy measure for immediate learning gain. At each training step, the curriculum policy selects categories to maximize this reward signal and is updated using the TD(0) method. Across three distinct reasoning domains: planning, inductive reasoning, and mathematics, our experiments demonstrate that SEC significantly improves models’ reasoning capabilities, enabling better generalization to harder, out-of-distribution test problems. Additionally, our approach achieves better skill balance when fine-tuning simultaneously on multiple reasoning domains. These findings highlight SEC as a promising strategy for RL fine-tuning of LLMs.

[290] Embracing Contradiction: Theoretical Inconsistency Will Not Impede the Road of Building Responsible AI Systems

Gordon Dai, Yunze Xiao

Main category: cs.AI

TL;DR: Embrace theoretical inconsistency in Responsible AI metrics as a valuable feature rather than a flaw, arguing it enables normative pluralism, epistemological completeness, and implicit regularization.

DetailsMotivation: Address the common observation of theoretical inconsistency among RAI metrics (differing fairness definitions, accuracy-privacy tradeoffs) and challenge the view that this inconsistency should be eliminated.

Method: Propose treating metrics as divergent objectives and navigating their inconsistencies rather than enforcing theoretical consistency through simplification or pruning of metrics.

Result: Three key benefits identified: (1) Normative pluralism for diverse stakeholder values, (2) Epistemological completeness for comprehensive ethical concept capture, (3) Implicit regularization for enhanced generalization and robustness.

Conclusion: Advocate for a shift in RAI theory and practice from avoiding inconsistency to characterizing acceptable inconsistency thresholds and elucidating mechanisms for robust, approximated consistency.

Abstract: This position paper argues that the theoretical inconsistency often observed among Responsible AI (RAI) metrics, such as differing fairness definitions or tradeoffs between accuracy and privacy, should be embraced as a valuable feature rather than a flaw to be eliminated. We contend that navigating these inconsistencies, by treating metrics as divergent objectives, yields three key benefits: (1) Normative Pluralism: Maintaining a full suite of potentially contradictory metrics ensures that the diverse moral stances and stakeholder values inherent in RAI are adequately represented. (2) Epistemological Completeness: The use of multiple, sometimes conflicting, metrics allows for a more comprehensive capture of multifaceted ethical concepts, thereby preserving greater informational fidelity about these concepts than any single, simplified definition. (3) Implicit Regularization: Jointly optimizing for theoretically conflicting objectives discourages overfitting to one specific metric, steering models towards solutions with enhanced generalization and robustness under real-world complexities. In contrast, efforts to enforce theoretical consistency by simplifying or pruning metrics risk narrowing this value diversity, losing conceptual depth, and degrading model performance. We therefore advocate for a shift in RAI theory and practice: from getting trapped in inconsistency to characterizing acceptable inconsistency thresholds and elucidating the mechanisms that permit robust, approximated consistency in practice.

[291] CompoST: A Benchmark for Analyzing the Ability of LLMs To Compositionally Interpret Questions in a QALD Setting

David Maria Schmidt, Raoul Schubert, Philipp Cimiano

Main category: cs.AI

TL;DR: LLMs struggle with systematic compositional interpretation when mapping questions to SPARQL queries, with performance degrading significantly as questions deviate from training patterns.

DetailsMotivation: To investigate how systematic LLMs' language interpretation capabilities are, particularly whether they can compositionally interpret complex questions given they understand the atomic building blocks.

Method: Created three controlled datasets of varying difficulty based on DBpedia graph patterns using Lemon lexica for verbalization. Tested models of different sizes with various prompt techniques, few-shot optimization, and fine-tuning.

Result: Performance degrades from F1=0.45 to 0.26 to 0.09 with increasing deviation from optimized samples. Even with all necessary information provided, F1 doesn’t exceed 0.57 for simplest dataset.

Conclusion: LLMs struggle to systematically and compositionally interpret questions and map them to SPARQL queries.

Abstract: Language interpretation is a compositional process, in which the meaning of more complex linguistic structures is inferred from the meaning of their parts. Large language models possess remarkable language interpretation capabilities and have been successfully applied to interpret questions by mapping them to SPARQL queries. An open question is how systematic this interpretation process is. Toward this question, in this paper, we propose a benchmark for investigating to what extent the abilities of LLMs to interpret questions are actually compositional. For this, we generate three datasets of varying difficulty based on graph patterns in DBpedia, relying on Lemon lexica for verbalization. Our datasets are created in a very controlled fashion in order to test the ability of LLMs to interpret structurally complex questions, given that they have seen the atomic building blocks. This allows us to evaluate to what degree LLMs are able to interpret complex questions for which they “understand” the atomic parts. We conduct experiments with models of different sizes using both various prompt and few-shot optimization techniques as well as fine-tuning. Our results show that performance in terms of macro $F_1$ degrades from $0.45$ over $0.26$ down to $0.09$ with increasing deviation from the samples optimized on. Even when all necessary information was provided to the model in the input, the $F_1$ scores do not exceed $0.57$ for the dataset of lowest complexity. We thus conclude that LLMs struggle to systematically and compositionally interpret questions and map them into SPARQL queries.

[292] Refine-n-Judge: Curating High-Quality Preference Chains for LLM-Fine-Tuning

Derin Cayir, Renjie Tao, Rashi Rungta, Kai Sun, Sean Chen, Haidar Khan, Minseok Kim, Julia Reinspach, Yue Liu

Main category: cs.AI

TL;DR: Refine-n-Judge is an automated iterative method that uses a single LLM as both refiner and judge to improve dataset quality without human feedback, achieving significant performance gains in fine-tuned models.

DetailsMotivation: Human feedback for LLM fine-tuning is costly and doesn't scale well, creating a need for automated approaches to enhance dataset quality.

Method: Uses a single LLM iteratively: refines responses and judges whether refinements are improvements, continuing until no further improvements are possible, creating preference-labeled sequences.

Result: Models fine-tuned on Refine-n-Judge datasets were preferred in 74%+ comparisons against original datasets, with performance gains: +5% on AlpacaEval/2.0 and +19% on MT-Bench.

Conclusion: Refine-n-Judge effectively produces high-quality datasets and enables scalable model improvements without requiring human annotation or separate reward models.

Abstract: Large Language Models (LLMs) have demonstrated remarkable progress through preference-based fine-tuning, which critically depends on the quality of the underlying training data. While human feedback is essential for improving data quality, it is costly and does not scale well. In this paper, we introduce Refine-n-Judge, an automated iterative approach that leverages a single LLM as both a refiner and a judge to enhance dataset quality. Unlike existing iterative refinement methods, Refine-n-Judge employs an LLM to both generate refinements and explicitly evaluate each improvement, ensuring that every iteration meaningfully enhances the dataset without requiring additional human annotation or a separate reward model. At each step, the LLM refines a response and judges whether the refinement is an improvement over the previous answer. This process continues until the LLM prefers the initial answer over the refinement, indicating no further improvements. This produces sequences of increasing quality, preference-labeled responses ideal for fine-tuning. We demonstrate the effectiveness of Refine-n-Judge across a range of public datasets spanning five corpora, targeting tasks such as coding, math, and conversation. Models (Llama 3.1-8B and Llama 3.3-70B) fine-tuned on Refine-n-Judge-enhanced datasets were preferred by LLM judges in over 74% of comparisons against models tuned on the original dataset by GPT-4. Additionally, we report performance gains: +5% on AlpacaEval and AlpacaEval 2.0, and +19% on MT-Bench. Our results indicate that Refine-n-Judge produces high-quality datasets and scalable model improvements.

[293] Collab-REC: An LLM-based Agentic Framework for Balancing Recommendations in Tourism

Ashmi Banerjee, Adithi Satish, Fitri Nur Aisyah, Wolfgang Wörndl, Yashar Deldjoo

Main category: cs.AI

TL;DR: Collab-REC is a multi-agent framework using LLM-based agents to improve tourism recommendation diversity by countering popularity bias through collaborative negotiation.

DetailsMotivation: To address popularity bias in tourism recommendations that leads to over-tourism and overlooks lesser-visited locales, while enhancing recommendation diversity.

Method: Uses three LLM-based agents (Personalization, Popularity, Sustainability) that generate city suggestions from different perspectives, with a non-LLM moderator that merges and refines proposals through multi-round negotiation.

Result: Experiments on European city queries show improved diversity and overall relevance compared to single-agent baseline, surfacing overlooked lesser-visited locales.

Conclusion: The multi-stakeholder collaborative approach effectively addresses over-tourism and better aligns with user constraints, demonstrating promise for LLM-driven recommender systems.

Abstract: We propose Collab-REC, a multi-agent framework designed to counteract popularity bias and enhance diversity in tourism recommendations. In our setting, three LLM-based agents – Personalization, Popularity, and Sustainability generate city suggestions from complementary perspectives. A non-LLM moderator then merges and refines these proposals via multi-round negotiation, ensuring each agent’s viewpoint is incorporated while penalizing spurious or repeated responses. Experiments on European city queries show that Collab-REC improves diversity and overall relevance compared to a single-agent baseline, surfacing lesser-visited locales that often remain overlooked. This balanced, context-aware approach addresses over-tourism and better aligns with constraints provided by the user, highlighting the promise of multi-stakeholder collaboration in LLM-driven recommender systems.

[294] Fuzzy, Symbolic, and Contextual: Enhancing LLM Instruction via Cognitive Scaffolding

Vanessa Figueiredo

Main category: cs.AI

TL;DR: The paper studies how prompt-level inductive biases affect LLMs’ cognitive behavior in instructional dialogue, using symbolic scaffolding and short-term memory to enhance structured reasoning in Socratic tutoring.

DetailsMotivation: To understand how prompt-level cognitive scaffolds can shape emergent instructional strategies in LLMs and promote adaptive, structured reasoning.

Method: Introduces symbolic scaffolding with short-term memory schema, uses controlled ablation across five system variants, and evaluates with expert-designed rubrics covering scaffolding, responsiveness, symbolic reasoning, and conversational memory.

Result: The full system consistently outperforms baseline variants. Removing memory or symbolic structure degrades key cognitive behaviors like abstraction, adaptive probing, and conceptual continuity.

Conclusion: Prompt-level cognitive scaffolds can reliably shape emergent instructional strategies in LLMs, supporting a processing-level account of cognitive behavior.

Abstract: We study how prompt-level inductive biases influence the cognitive behavior of large language models (LLMs) in instructional dialogue. We introduce a symbolic scaffolding method paired with a short-term memory schema designed to promote adaptive, structured reasoning in Socratic tutoring. Using controlled ablation across five system variants, we evaluate model outputs via expert-designed rubrics covering scaffolding, responsiveness, symbolic reasoning, and conversational memory. We present preliminary results using an LLM-based evaluation framework aligned to a cognitively grounded rubric. This enables scalable, systematic comparisons across architectural variants in early-stage experimentation. The preliminary results show that our full system consistently outperforms baseline variants. Analysis reveals that removing memory or symbolic structure degrades key cognitive behaviors, including abstraction, adaptive probing, and conceptual continuity. These findings support a processing-level account in which prompt-level cognitive scaffolds can reliably shape emergent instructional strategies in LLMs.

[295] FESTA: Functionally Equivalent Sampling for Trust Assessment of Multimodal LLMs

Debarpan Bhattacharya, Apoorva Kulkarni, Sriram Ganapathy

Main category: cs.AI

TL;DR: FESTA is a black-box, unsupervised method for assessing trust in multimodal LLMs by generating uncertainty measures through functionally equivalent sampling of inputs.

DetailsMotivation: Accurate trust assessment of MLLM predictions is challenging due to diverse multimodal inputs, and is needed for selective prediction and improving user confidence.

Method: Functionally Equivalent Sampling for Trust Assessment (FESTA) uses task-preserving sampling to generate equivalent and complementary input samples to probe model consistency and sensitivity, requiring only black-box input-output access without ground truth.

Result: FESTA achieves 33.3% relative improvement for vision-LLMs and 29.6% for audio-LLMs in selective prediction performance (AUROC metric) for detecting mispredictions.

Conclusion: FESTA provides an effective uncertainty quantification method for multimodal LLMs that improves trust assessment without requiring model internals or ground truth labels.

Abstract: The accurate trust assessment of multimodal large language models (MLLMs) generated predictions, which can enable selective prediction and improve user confidence, is challenging due to the diverse multi-modal input paradigms. We propose Functionally Equivalent Sampling for Trust Assessment (FESTA), a multimodal input sampling technique for MLLMs, that generates an uncertainty measure based on the equivalent and complementary input samplings. The proposed task-preserving sampling approach for uncertainty quantification expands the input space to probe the consistency (through equivalent samples) and sensitivity (through complementary samples) of the model. FESTA uses only input-output access of the model (black-box), and does not require ground truth (unsupervised). The experiments are conducted with various off-the-shelf multi-modal LLMs, on both visual and audio reasoning tasks. The proposed FESTA uncertainty estimate achieves significant improvement (33.3% relative improvement for vision-LLMs and 29.6% relative improvement for audio-LLMs) in selective prediction performance, based on area-under-receiver-operating-characteristic curve (AUROC) metric in detecting mispredictions. The code implementation is open-sourced.

[296] TERAG: Token-Efficient Graph-Based Retrieval-Augmented Generation

Qiao Xiao, Hong Ting Tsang, Jiaxin Bai

Main category: cs.AI

TL;DR: TERAG is a cost-efficient graph-based RAG framework that achieves 80% accuracy of existing methods while using only 3-11% of output tokens.

DetailsMotivation: Existing graph-based RAG systems have high LLM token costs during graph construction, limiting large-scale adoption.

Method: Incorporates Personalized PageRank (PPR) during retrieval phase, inspired by HippoRAG, to build informative graphs with minimal token usage.

Result: Achieves at least 80% accuracy compared to widely used graph-based RAG methods while consuming only 3%-11% of output tokens.

Conclusion: TERAG’s low token footprint and efficient construction make it suitable for large-scale, cost-sensitive deployment scenarios.

Abstract: Graph-based Retrieval-augmented generation (RAG) has become a widely studied approach for improving the reasoning, accuracy, and factuality of Large Language Models (LLMs). However, many existing graph-based RAG systems overlook the high cost associated with LLM token usage during graph construction, hindering large-scale adoption. To address this, we propose TERAG, a simple yet effective framework designed to build informative graphs at a significantly lower cost. Inspired by HippoRAG, we incorporate Personalized PageRank (PPR) during the retrieval phase, and we achieve at least 80% of the accuracy of widely used graph-based RAG methods while consuming only 3%-11% of the output tokens. With its low token footprint and efficient construction pipeline, TERAG is well-suited for large-scale and cost-sensitive deployment scenarios.

[297] Demystifying the Roles of LLM Layers in Retrieval, Knowledge, and Reasoning

Xinyuan Song, Keyu Wang, PengXiang Li, Lu Yin, Shiwei Liu

Main category: cs.AI

TL;DR: Depth utilization in LLMs is heterogeneous - shallow layers handle knowledge/retrieval while deeper layers enable reasoning and coherence, with effectiveness varying by evaluation method.

DetailsMotivation: To systematically investigate whether deep layers in LLMs are truly redundant, challenging narrow evaluations that suggest they can be removed without performance loss.

Method: Comprehensive analysis across diverse dimensions including evaluation protocols (likelihood vs generation), task categories, and model architectures to study depth utilization patterns.

Result: Shallow layers are critical for knowledge and retrieval, while middle/deeper layers are indispensable for reasoning and long-range coherence. Performance preservation depends on evaluation method - likelihood metrics show most layers can be pruned, but generation reveals deeper layer importance.

Conclusion: Depth usage in LLMs is highly context-dependent, requiring task-, metric-, and model-aware perspectives for both interpretation and compression of large models.

Abstract: Recent studies suggest that the deeper layers of Large Language Models (LLMs) contribute little to representation learning and can often be removed without significant performance loss. However, such claims are typically drawn from narrow evaluations and may overlook important aspects of model behavior. In this work, we present a systematic study of depth utilization across diverse dimensions, including evaluation protocols, task categories, and model architectures. Our analysis confirms that very deep layers are generally less effective than earlier ones, but their contributions vary substantially with the evaluation setting. Under likelihood-based metrics without generation, pruning most layers preserves performance, with only the initial few being critical. By contrast, generation-based evaluation uncovers indispensable roles for middle and deeper layers in enabling reasoning and maintaining long-range coherence. We further find that knowledge and retrieval are concentrated in shallow components, whereas reasoning accuracy relies heavily on deeper layers – yet can be reshaped through distillation. These results highlight that depth usage in LLMs is highly heterogeneous and context-dependent, underscoring the need for task-, metric-, and model-aware perspectives in both interpreting and compressing large models.

[298] Think Then Embed: Generative Context Improves Multimodal Embedding

Xuanming Cui, Jianpeng Cheng, Hong-you Chen, Satya Narayan Shukla, Abhijeet Awasthi, Xichen Pan, Chaitanya Ahuja, Shlok Kumar Mishra, Yonghuan Yang, Jun Xiao, Qi Guo, Ser-Nam Lim, Aashu Singh, Xiangjun Fan

Main category: cs.AI

TL;DR: TTE framework improves Universal Multimodal Embeddings by adding explicit reasoning through MLLMs before embedding, achieving SOTA performance on MMEB-V2 benchmark.

DetailsMotivation: Current MLLM-based UME approaches treat models solely as encoders, ignoring generative capacity and struggling with complex compositional reasoning tasks.

Method: Think-Then-Embed framework with two components: a reasoner MLLM that generates reasoning traces for complex queries, followed by an embedder that produces representations conditioned on both original query and intermediate reasoning.

Result: Achieved state-of-the-art on MMEB-V2 benchmark, surpassing proprietary models. Also created smaller MLLM reasoner with 7% absolute gain over recent open-source models, and developed unified model for efficiency.

Conclusion: Explicit reasoning step enables better understanding of complex multimodal instructions, and the framework can be optimized for efficiency while maintaining performance.

Abstract: There is a growing interest in Universal Multimodal Embeddings (UME), where models are required to generate task-specific representations. While recent studies show that Multimodal Large Language Models (MLLMs) perform well on such tasks, they treat MLLMs solely as encoders, overlooking their generative capacity. However, such an encoding paradigm becomes less effective as instructions become more complex and require compositional reasoning. Inspired by the proven effectiveness of chain-of-thought reasoning, we propose a general Think-Then-Embed (TTE) framework for UME, composed of a reasoner and an embedder. The reasoner MLLM first generates reasoning traces that explain complex queries, followed by an embedder that produces representations conditioned on both the original query and the intermediate reasoning. This explicit reasoning step enables more nuanced understanding of complex multimodal instructions. Our contributions are threefold. First, by leveraging a powerful MLLM reasoner, we achieve state-of-the-art performance on the MMEB-V2 benchmark, surpassing proprietary models trained on massive in-house datasets. Second, to reduce the dependency on large MLLM reasoners, we finetune a smaller MLLM reasoner using high-quality embedding-centric reasoning traces, achieving the best performance among open-source models with a 7% absolute gain over recently proposed models. Third, we investigate strategies for integrating the reasoner and embedder into a unified model for improved efficiency without sacrificing performance.

[299] Earth AI: Unlocking Geospatial Insights with Foundation Models and Cross-Modal Reasoning

Aaron Bell, Amit Aides, Amr Helmy, Arbaaz Muslim, Aviad Barzilai, Aviv Slobodkin, Bolous Jaber, David Schottlander, George Leifman, Joydeep Paul, Mimi Sun, Nadav Sherman, Natalie Williams, Per Bjornsson, Roy Lee, Ruth Alcantara, Thomas Turnbull, Tomer Shekel, Vered Silverman, Yotam Gigi, Adam Boulanger, Alex Ottenwess, Ali Ahmadalipour, Anna Carter, Behzad Vahedi, Charles Elliott, David Andre, Elad Aharoni, Gia Jung, Hassler Thurston, Jacob Bien, Jamie McPike, Juliet Rothenberg, Kartik Hegde, Kel Markert, Kim Philipp Jablonski, Luc Houriez, Monica Bharel, Phing VanLee, Reuven Sayag, Sebastian Pilarski, Shelley Cazares, Shlomi Pasternak, Siduo Jiang, Thomas Colthurst, Yang Chen, Yehonathan Refael, Yochai Blau, Yuval Carny, Yael Maguire, Avinatan Hassidim, James Manyika, Tim Thelin, Genady Beryozkin, Gautam Prasad, Luke Barrington, Yossi Matias, Niv Efron, Shravya Shetty

Main category: cs.AI

TL;DR: Earth AI introduces geospatial AI models and agentic reasoning to analyze diverse geospatial data, using foundation models across Planet-scale Imagery, Population, and Environment domains with Gemini-powered reasoning for actionable insights.

DetailsMotivation: The immense volume and diversity of geospatial data with varied resolutions, timescales, and sparsity pose significant challenges for thorough analysis and interpretation of our planet.

Method: Built foundation models across three domains (Planet-scale Imagery, Population, Environment) and developed a Gemini-powered agent that jointly reasons over multiple foundation models, geospatial data sources, and tools.

Result: Rigorous benchmarks show superior predictive capabilities when foundation models are used together. The agent delivers critical insights on real-world crisis scenarios, bridging the gap between raw data and actionable understanding.

Conclusion: Earth AI enables significant advances in unlocking novel and profound insights into our planet through synergistic geospatial AI models and intelligent agentic reasoning.

Abstract: Geospatial data offers immense potential for understanding our planet. However, the sheer volume and diversity of this data along with its varied resolutions, timescales, and sparsity pose significant challenges for thorough analysis and interpretation. This paper introduces Earth AI, a family of geospatial AI models and agentic reasoning that enables significant advances in our ability to unlock novel and profound insights into our planet. This approach is built upon foundation models across three key domains–Planet-scale Imagery, Population, and Environment–and an intelligent Gemini-powered reasoning engine. We present rigorous benchmarks showcasing the power and novel capabilities of our foundation models and validate that when used together, they provide complementary value for geospatial inference and their synergies unlock superior predictive capabilities. To handle complex, multi-step queries, we developed a Gemini-powered agent that jointly reasons over our multiple foundation models along with large geospatial data sources and tools. On a new benchmark of real-world crisis scenarios, our agent demonstrates the ability to deliver critical and timely insights, effectively bridging the gap between raw geospatial data and actionable understanding.

[300] LAFA: Agentic LLM-Driven Federated Analytics over Decentralized Data Sources

Haichao Ji, Zibo Wang, Cheng Pan, Meng Han, Yifei Zhu, Dan Wang, Zhu Han

Main category: cs.AI

TL;DR: LAFA integrates LLM-agent-based data analytics with federated analytics to enable privacy-preserving natural language queries on distributed data sources.

DetailsMotivation: Existing LLM-agent analytics frameworks lack privacy protection by assuming centralized data access, while federated analytics preserves privacy but requires structured queries instead of natural language input.

Method: LAFA uses a hierarchical multi-agent architecture with a coarse-grained planner to decompose queries into sub-queries, a fine-grained planner to map subqueries to FA operation DAGs, and an optimizer agent to rewrite and merge DAGs for efficiency.

Result: LAFA outperforms baseline prompting strategies with higher execution plan success rates and substantially reduces resource-intensive FA operations.

Conclusion: LAFA establishes a practical foundation for privacy-preserving, LLM-driven analytics that supports natural language input in federated analytics settings.

Abstract: Large Language Models (LLMs) have shown great promise in automating data analytics tasks by interpreting natural language queries and generating multi-operation execution plans. However, existing LLM-agent-based analytics frameworks operate under the assumption of centralized data access, offering little to no privacy protection. In contrast, federated analytics (FA) enables privacy-preserving computation across distributed data sources, but lacks support for natural language input and requires structured, machine-readable queries. In this work, we present LAFA, the first system that integrates LLM-agent-based data analytics with FA. LAFA introduces a hierarchical multi-agent architecture that accepts natural language queries and transforms them into optimized, executable FA workflows. A coarse-grained planner first decomposes complex queries into sub-queries, while a fine-grained planner maps each subquery into a Directed Acyclic Graph of FA operations using prior structural knowledge. To improve execution efficiency, an optimizer agent rewrites and merges multiple DAGs, eliminating redundant operations and minimizing computational and communicational overhead. Our experiments demonstrate that LAFA consistently outperforms baseline prompting strategies by achieving higher execution plan success rates and reducing resource-intensive FA operations by a substantial margin. This work establishes a practical foundation for privacy-preserving, LLM-driven analytics that supports natural language input in the FA setting.

[301] Beyond Reactivity: Measuring Proactive Problem Solving in LLM Agents

Gil Pasternak, Dheeraj Rajagopal, Julia White, Dhruv Atreja, Matthew Thomas, George Hurn-Maloney, Ash Lewis

Main category: cs.AI

TL;DR: PROBE is a new benchmark that evaluates LLM-based agents’ proactivity through a three-step pipeline: searching for unspecified issues, identifying bottlenecks, and executing resolutions. Current state-of-the-art models struggle with this benchmark, achieving only 40% performance.

DetailsMotivation: Current benchmarks for evaluating proactive LLM agents are limited to localized contexts and cannot test reasoning across multiple sources and longer time horizons, creating a gap in comprehensive proactivity assessment.

Method: PROBE decomposes proactivity into three core capabilities: (1) searching for unspecified issues, (2) identifying specific bottlenecks, and (3) executing appropriate resolutions. This pipeline approach is applied to evaluate leading LLMs and agentic frameworks.

Result: Even state-of-the-art models struggle with PROBE, with the best end-to-end performance of 40% achieved by both GPT-5 and Claude Opus-4.1. The study reveals mutual failure modes and relative capabilities across different models.

Conclusion: The results highlight current limitations in autonomous action for agentic systems and expose promising future research directions for improving proactive capabilities in LLM-based agents.

Abstract: LLM-based agents are increasingly moving towards proactivity: rather than awaiting instruction, they exercise agency to anticipate user needs and solve them autonomously. However, evaluating proactivity is challenging; current benchmarks are constrained to localized context, limiting their ability to test reasoning across sources and longer time horizons. To address this gap, we present PROBE (Proactive Resolution Of BottlEnecks). PROBE decomposes proactivity as a pipeline of three core capabilities: (1) searching for unspecified issues, (2) identifying specific bottlenecks, and (3) executing appropriate resolutions. We apply PROBE to evaluate leading LLMs and popular agentic frameworks, showing that even state-of-the-art models struggle to solve this benchmark. Computing our consistent measurements across frontier LLMs and agents, we find that the best end-to-end performance of 40% is achieved by both GPT-5 and Claude Opus-4.1. Additionally, we demonstrate the relative capabilities of each model and analyze mutual failure modes. Our results highlight the current limitations of autonomous action in agentic systems, and expose promising future research directions.

[302] Lost in Tokenization: Context as the Key to Unlocking Biomolecular Understanding in Scientific LLMs

Kai Zhuang, Jiawei Zhang, Yumou Liu, Hanqun Cao, Chunbin Gu, Mengdi Liu, Zhangyang Gao, Zitong Jerry Wang, Xuanhe Zhou, Pheng-Ann Heng, Lijun Wu, Conghui He, Cheng Tan

Main category: cs.AI

TL;DR: Sci-LLMs perform better when given high-level structured context from bioinformatics tools rather than raw biomolecular sequences, as sequences act as noise that degrades performance.

DetailsMotivation: Current Sci-LLMs face a tokenization dilemma when processing raw biomolecular sequences - either treating sequences as specialized language (risking loss of functional motifs) or as separate modality (creating alignment challenges), limiting reasoning capacity.

Method: Systematic comparison of leading Sci-LLMs on biological reasoning tasks using three input modes: sequence-only, context-only, and combination of both.

Result: Context-only approach consistently and substantially outperforms all other modes. Including raw sequences alongside context degrades performance, indicating sequences act as informational noise even for specialized models.

Conclusion: Sci-LLMs should be reframed as powerful reasoning engines over expert knowledge rather than sequence decoders, laying foundation for hybrid scientific AI agents focused on high-level knowledge synthesis.

Abstract: Scientific Large Language Models (Sci-LLMs) have emerged as a promising frontier for accelerating biological discovery. However, these models face a fundamental challenge when processing raw biomolecular sequences: the tokenization dilemma. Whether treating sequences as a specialized language, risking the loss of functional motif information, or as a separate modality, introducing formidable alignment challenges, current strategies fundamentally limit their reasoning capacity. We challenge this sequence-centric paradigm by positing that a more effective strategy is to provide Sci-LLMs with high-level structured context derived from established bioinformatics tools, thereby bypassing the need to interpret low-level noisy sequence data directly. Through a systematic comparison of leading Sci-LLMs on biological reasoning tasks, we tested three input modes: sequence-only, context-only, and a combination of both. Our findings are striking: the context-only approach consistently and substantially outperforms all other modes. Even more revealing, the inclusion of the raw sequence alongside its high-level context consistently degrades performance, indicating that raw sequences act as informational noise, even for models with specialized tokenization schemes. These results suggest that the primary strength of existing Sci-LLMs lies not in their nascent ability to interpret biomolecular syntax from scratch, but in their profound capacity for reasoning over structured, human-readable knowledge. Therefore, we argue for reframing Sci-LLMs not as sequence decoders, but as powerful reasoning engines over expert knowledge. This work lays the foundation for a new class of hybrid scientific AI agents, repositioning the developmental focus from direct sequence interpretation towards high-level knowledge synthesis. The code is available at https://github.com/opendatalab-raiser/CoKE.

[303] Human-Like Goalkeeping in a Realistic Football Simulation: a Sample-Efficient Reinforcement Learning Approach

Alessandro Sestini, Joakim Bergdahl, Jean-Philippe Barrette-LaPierre, Florian Fuchs, Brady Chen, Michael Jones, Linus Gisslén

Main category: cs.AI

TL;DR: A sample-efficient DRL method for training human-like game AI agents that improves training speed by 50% and outperforms built-in game AI by 10% in ball saving rate, adopted in EA SPORTS FC 25.

DetailsMotivation: DRL has rarely been used in game industry for authentic AI behaviors due to impractical large models and super-human agents. Game studios need human-like agents with limited resources.

Method: Sample-efficient DRL method that improves value-based DRL by leveraging pre-collected data and increasing network plasticity, tailored for industrial game development settings.

Result: Goalkeeper agent in EA SPORTS FC 25 outperforms built-in AI by 10% in ball saving rate, trains 50% faster than standard DRL methods, and creates more human-like gameplay according to domain experts.

Conclusion: The method successfully addresses practical game industry needs for human-like AI and has been adopted in the latest EA SPORTS FC release, demonstrating real-world impact.

Abstract: While several high profile video games have served as testbeds for Deep Reinforcement Learning (DRL), this technique has rarely been employed by the game industry for crafting authentic AI behaviors. Previous research focuses on training super-human agents with large models, which is impractical for game studios with limited resources aiming for human-like agents. This paper proposes a sample-efficient DRL method tailored for training and fine-tuning agents in industrial settings such as the video game industry. Our method improves sample efficiency of value-based DRL by leveraging pre-collected data and increasing network plasticity. We evaluate our method training a goalkeeper agent in EA SPORTS FC 25, one of the best-selling football simulations today. Our agent outperforms the game’s built-in AI by 10% in ball saving rate. Ablation studies show that our method trains agents 50% faster compared to standard DRL methods. Finally, qualitative evaluation from domain experts indicates that our approach creates more human-like gameplay compared to hand-crafted agents. As a testament to the impact of the approach, the method has been adopted for use in the most recent release of the series.

[304] Multi-Agent Evolve: LLM Self-Improve through Co-evolution

Yixing Chen, Yiding Wang, Siqi Zhu, Haofei Yu, Tao Feng, Muhan Zhang, Mostofa Patwary, Jiaxuan You

Main category: cs.AI

TL;DR: MAE is a multi-agent self-evolution framework that uses three interacting agents (Proposer, Solver, Judge) from a single LLM to enhance reasoning capabilities without human supervision, achieving 4.54% average improvement on benchmarks.

DetailsMotivation: Current RL methods for LLMs rely heavily on human-curated datasets and verifiable rewards, limiting scalability. Self-Play RL methods require grounded environments, making generalization challenging.

Method: MAE uses a triplet agent system (Proposer, Solver, Judge) instantiated from one LLM. The Proposer generates questions, Solver provides solutions, and Judge evaluates both, with all agents co-evolving through reinforcement learning.

Result: Experiments on Qwen2.5-3B-Instruct show MAE achieves 4.54% average improvement across multiple benchmarks in mathematics, reasoning, and general knowledge Q&A.

Conclusion: MAE provides a scalable, data-efficient method for enhancing LLM reasoning abilities with minimal human supervision, demonstrating strong performance across diverse tasks.

Abstract: Reinforcement Learning (RL) has demonstrated significant potential in enhancing the reasoning capabilities of large language models (LLMs). However, the success of RL for LLMs heavily relies on human-curated datasets and verifiable rewards, which limit their scalability and generality. Recent Self-Play RL methods, inspired by the success of the paradigm in games and Go, aim to enhance LLM reasoning capabilities without human-annotated data. However, their methods primarily depend on a grounded environment for feedback (e.g., a Python interpreter or a game engine); extending them to general domains remains challenging. To address these challenges, we propose Multi-Agent Evolve (MAE), a framework that enables LLMs to self-evolve in solving diverse tasks, including mathematics, reasoning, and general knowledge Q&A. The core design of MAE is based on a triplet of interacting agents (Proposer, Solver, Judge) that are instantiated from a single LLM, and applies reinforcement learning to optimize their behaviors. The Proposer generates questions, the Solver attempts solutions, and the Judge evaluates both while co-evolving. Experiments on Qwen2.5-3B-Instruct demonstrate that MAE achieves an average improvement of 4.54% on multiple benchmarks. These results highlight MAE as a scalable, data-efficient method for enhancing the general reasoning abilities of LLMs with minimal reliance on human-curated supervision.

[305] Latent Chain-of-Thought for Visual Reasoning

Guohao Sun, Hang Hua, Jian Wang, Jiebo Luo, Sohail Dianat, Majid Rabbani, Raghuveer Rao, Zhiqiang Tao

Main category: cs.AI

TL;DR: The paper proposes a novel training algorithm for Large Vision-Language Models that reformulates reasoning as posterior inference using amortized variational inference and diversity-seeking reinforcement learning to improve CoT reasoning across diverse tasks.

DetailsMotivation: Existing training algorithms like SFT, PPO, and GRPO don't generalize well across unseen reasoning tasks and rely heavily on biased reward models, limiting the effectiveness and reliability of Chain-of-thought reasoning in LVLMs.

Method: Reformulates reasoning as posterior inference using amortized variational inference, introduces sparse reward functions for token-level learning signals that encourage diverse latent CoT, and implements Bayesian inference-scaling strategy with marginal likelihood for efficient rationale ranking.

Result: Empirically demonstrates enhanced performance on seven reasoning benchmarks, improving state-of-the-art LVLMs in effectiveness, generalization, and interpretability.

Conclusion: The proposed method successfully addresses limitations of existing training approaches by leveraging variational inference and diversity-seeking reinforcement learning, leading to more robust and interpretable reasoning capabilities in LVLMs.

Abstract: Chain-of-thought (CoT) reasoning is critical for improving the interpretability and reliability of Large Vision-Language Models (LVLMs). However, existing training algorithms such as SFT, PPO, and GRPO may not generalize well across unseen reasoning tasks and heavily rely on a biased reward model. To address this challenge, we reformulate reasoning in LVLMs as posterior inference and propose a scalable training algorithm based on amortized variational inference. By leveraging diversity-seeking reinforcement learning algorithms, we introduce a novel sparse reward function for token-level learning signals that encourage diverse, high-likelihood latent CoT, overcoming deterministic sampling limitations and avoiding reward hacking. Additionally, we implement a Bayesian inference-scaling strategy that replaces costly Best-of-N and Beam Search with a marginal likelihood to efficiently rank optimal rationales and answers. We empirically demonstrate that the proposed method enhances the state-of-the-art LVLMs on seven reasoning benchmarks, in terms of effectiveness, generalization, and interpretability.

cs.SD

[306] ALMGuard: Safety Shortcuts and Where to Find Them as Guardrails for Audio-Language Models

Weifei Jin, Yuxin Cao, Junjie Su, Minhui Xue, Jie Hao, Ke Xu, Jin Song Dong, Derui Wang

Main category: cs.SD

TL;DR: ALMGuard is the first defense framework specifically designed for Audio-Language Models (ALMs) that uses Shortcut Activation Perturbations (SAPs) and Mel-Gradient Sparse Mask (M-GSM) to protect against jailbreak attacks while maintaining utility on benign tasks.

DetailsMotivation: Existing defenses from traditional audio adversarial attacks or text-based LLM jailbreaks are ineffective against ALM-specific threats, creating a need for specialized protection for Audio-Language Models.

Method: The approach identifies universal Shortcut Activation Perturbations (SAPs) that activate safety shortcuts in ALMs, and uses Mel-Gradient Sparse Mask (M-GSM) to restrict perturbations to Mel-frequency bins sensitive to jailbreaks but not to speech understanding.

Result: ALMGuard reduces the average success rate of advanced ALM-specific jailbreak attacks to 4.6% across four models while maintaining comparable utility on benign benchmarks.

Conclusion: ALMGuard establishes a new state-of-the-art defense for Audio-Language Models, providing robust protection against both seen and unseen jailbreak attacks without compromising model utility.

Abstract: Recent advances in Audio-Language Models (ALMs) have significantly improved multimodal understanding capabilities. However, the introduction of the audio modality also brings new and unique vulnerability vectors. Previous studies have proposed jailbreak attacks that specifically target ALMs, revealing that defenses directly transferred from traditional audio adversarial attacks or text-based Large Language Model (LLM) jailbreaks are largely ineffective against these ALM-specific threats. To address this issue, we propose ALMGuard, the first defense framework tailored to ALMs. Based on the assumption that safety-aligned shortcuts naturally exist in ALMs, we design a method to identify universal Shortcut Activation Perturbations (SAPs) that serve as triggers that activate the safety shortcuts to safeguard ALMs at inference time. To better sift out effective triggers while preserving the model’s utility on benign tasks, we further propose Mel-Gradient Sparse Mask (M-GSM), which restricts perturbations to Mel-frequency bins that are sensitive to jailbreaks but insensitive to speech understanding. Both theoretical analyses and empirical results demonstrate the robustness of our method against both seen and unseen attacks. Overall, \MethodName reduces the average success rate of advanced ALM-specific jailbreak attacks to 4.6% across four models, while maintaining comparable utility on benign benchmarks, establishing it as the new state of the art. Our code and data are available at https://github.com/WeifeiJin/ALMGuard.

[307] SP-MCQA: Evaluating Intelligibility of TTS Beyond the Word Level

Hitomi Jin Ling Tee, Chaoren Wang, Zijie Zhang, Zhizheng Wu

Main category: cs.SD

TL;DR: Proposes Spoken-Passage Multiple-Choice Question Answering (SP-MCQA) as a novel subjective evaluation method for TTS intelligibility, revealing that low WER doesn’t guarantee high key-information accuracy.

DetailsMotivation: Existing TTS intelligibility assessments rely too heavily on word-by-word accuracy metrics like WER, which fail to capture real-world speech complexity and human comprehension needs.

Method: Introduces SP-MCQA-Eval, an 8.76-hour news-style benchmark dataset for evaluating key information accuracy in synthesized speech through multiple-choice question answering.

Result: Experiments show low WER doesn’t guarantee high key-information accuracy, exposing gaps between traditional metrics and practical intelligibility. SOTA models still lack robust text normalization and phonetic accuracy.

Conclusion: Urgent need for high-level, life-like evaluation criteria since many systems excel at WER but fall short on real-world intelligibility.

Abstract: The evaluation of intelligibility for TTS has reached a bottleneck, as existing assessments heavily rely on word-by-word accuracy metrics such as WER, which fail to capture the complexity of real-world speech or reflect human comprehension needs. To address this, we propose Spoken-Passage Multiple-Choice Question Answering, a novel subjective approach evaluating the accuracy of key information in synthesized speech, and release SP-MCQA-Eval, an 8.76-hour news-style benchmark dataset for SP-MCQA evaluation. Our experiments reveal that low WER does not necessarily guarantee high key-information accuracy, exposing a gap between traditional metrics and practical intelligibility. SP-MCQA shows that even state-of-the-art (SOTA) models still lack robust text normalization and phonetic accuracy. This work underscores the urgent need for high-level, more life-like evaluation criteria now that many systems already excel at WER yet may fall short on real-world intelligibility.

[308] Modeling strategies for speech enhancement in the latent space of a neural audio codec

Sofiene Kammoun, Xavier Alameda-Pineda, Simon Leglaive

Main category: cs.SD

TL;DR: This paper compares continuous vs discrete neural audio codec representations for speech enhancement, finding continuous representations outperform discrete ones, non-autoregressive models are more practical than autoregressive ones, and encoder fine-tuning gives best enhancement metrics but harms codec reconstruction.

DetailsMotivation: To investigate how continuous vector representations compare to discrete token representations from neural audio codecs when used as training targets for supervised speech enhancement.

Method: Used autoregressive and non-autoregressive Conformer-based speech enhancement models, plus a baseline of fine-tuning the NAC encoder directly for enhancement. Compared continuous latent representations vs discrete token prediction.

Result: Continuous latent representations consistently outperformed discrete token prediction; autoregressive models achieved higher quality but hurt intelligibility and efficiency; encoder fine-tuning yielded strongest enhancement metrics overall but degraded codec reconstruction.

Conclusion: Continuous representations are superior to discrete ones for speech enhancement, non-autoregressive models are more practical than autoregressive ones, and encoder fine-tuning provides the best enhancement performance despite compromising codec reconstruction quality.

Abstract: Neural audio codecs (NACs) provide compact latent speech representations in the form of sequences of continuous vectors or discrete tokens. In this work, we investigate how these two types of speech representations compare when used as training targets for supervised speech enhancement. We consider both autoregressive and non-autoregressive speech enhancement models based on the Conformer architecture, as well as a simple baseline where the NAC encoder is simply fine-tuned for speech enhancement. Our experiments reveal three key findings: predicting continuous latent representations consistently outperforms discrete token prediction; autoregressive models achieve higher quality but at the expense of intelligibility and efficiency, making non-autoregressive models more attractive in practice; and encoder fine-tuning yields the strongest enhancement metrics overall, though at the cost of degraded codec reconstruction. The code and audio samples are available online.

[309] UniTok-Audio: A Unified Audio Generation Framework via Generative Modeling on Discrete Codec Tokens

Chengwei Liu, Haoyin Yan, Shaofei Xue, Xiaotao Liang, Yinghao Liu, Zheng Xue, Gang Song, Boyang Zhou

Main category: cs.SD

TL;DR: UniTok-Audio is a unified framework for multiple audio generation tasks that uses continuous feature extraction, discrete token generation, and a dual-stream audio codec to achieve competitive performance across five time-aligned tasks.

DetailsMotivation: Current audio generation models face challenges in audio quality and generalization across tasks, leading to redundant development efforts, inconsistent performance, and limited extensibility.

Method: 1) Extracts continuous features of conditions to generate discrete tokens autoregressively; 2) Uses special task identifier tokens to unify different learning patterns; 3) Develops a dual-stream audio codec with acoustic and semantic branches for high-fidelity waveform reconstruction.

Result: Achieves competitive performance compared to state-of-the-art task-specific or multi-task systems across five time-aligned tasks: speech restoration, target speaker extraction, speech separation, voice conversion, and language-queried audio source separation.

Conclusion: UniTok-Audio provides a scalable and extensible framework that successfully addresses fragmentation in audio generation models while maintaining high performance across multiple tasks.

Abstract: Generative modeling has recently achieved remarkable success across text, image, and audio domains, demonstrating powerful capabilities for unified representation learning. However, audio generation models still face challenges in terms of audio quality and generalization ability across tasks. This fragmentation results in redundant development efforts, inconsistent performance, and limited extensibility. To address these issues, we propose \textbf{UniTok-Audio}, a scalable and extensible framework for unified audio generation tasks. Specifically, 1) UniTok-Audio extracts continuous feature of conditions to generates discrete tokens of target audio in an autoregressive manner; 2) a special task identifier token unifies different learning patterns of multiple tasks in a single framework; 3) a dual-stream audio codec involving acoustic and semantic branch is developed for high-fidelity waveform reconstruction. Experimental results demonstrate that UniTok-Audio achieves competitive performance in comparation with state-of-the-art task-specific or multi-task systems across five time-aligned tasks: speech restoration, target speaker extraction, speech separation, voice conversion, and language-queried audio source separation. To foster future research, we will open-source our codebase. The demo page of our work can be found here: https://alibaba.github.io/unified-audio.

[310] ARECHO: Autoregressive Evaluation via Chain-Based Hypothesis Optimization for Speech Multi-Metric Estimation

Jiatong Shi, Yifan Cheng, Bo-Hao Su, Hye-jin Shim, Jinchuan Tian, Samuele Cornell, Yiwen Zhao, Siddhant Arora, Shinji Watanabe

Main category: cs.SD

TL;DR: ARECHO is an autoregressive speech evaluation system that uses chain-based dependency modeling to jointly predict multiple speech quality metrics like PESQ, STOI, and MOS, addressing their different scales and dependencies through dynamic classifier chains and confidence-oriented decoding.

DetailsMotivation: Speech quality evaluation involves predicting multiple perceptual and objective metrics that have different scales, assumptions, and dependencies, making joint estimation challenging. Existing approaches struggle with these inter-metric relationships and scale differences.

Method: ARECHO employs three key innovations: (1) comprehensive speech information tokenization pipeline, (2) dynamic classifier chain that explicitly captures inter-metric dependencies, and (3) two-step confidence-oriented decoding algorithm for enhanced inference reliability.

Result: Experiments show ARECHO significantly outperforms baseline frameworks across diverse evaluation scenarios including enhanced speech analysis, speech generation evaluation, and noisy speech evaluation. It also improves interpretability by capturing inter-metric relationships.

Conclusion: ARECHO provides a versatile, reference-free speech evaluation system that supports subset queries for single or multiple metrics, reduces error propagation through confidence-oriented decoding, and offers improved performance across various speech assessment tasks.

Abstract: Speech signal analysis poses significant challenges, particularly in tasks such as speech quality evaluation and profiling, where the goal is to predict multiple perceptual and objective metrics. For instance, metrics like PESQ (Perceptual Evaluation of Speech Quality), STOI (Short-Time Objective Intelligibility), and MOS (Mean Opinion Score) each capture different aspects of speech quality. However, these metrics often have different scales, assumptions, and dependencies, making joint estimation non-trivial. To address these issues, we introduce ARECHO (Autoregressive Evaluation via Chain-based Hypothesis Optimization), a chain-based, versatile evaluation system for speech assessment grounded in autoregressive dependency modeling. ARECHO is distinguished by three key innovations: (1) a comprehensive speech information tokenization pipeline; (2) a dynamic classifier chain that explicitly captures inter-metric dependencies; and (3) a two-step confidence-oriented decoding algorithm that enhances inference reliability. Experiments demonstrate that ARECHO significantly outperforms the baseline framework across diverse evaluation scenarios, including enhanced speech analysis, speech generation evaluation, and, noisy speech evaluation. Furthermore, its dynamic dependency modeling improves interpretability by capturing inter-metric relationships. Across tasks, ARECHO offers reference-free evaluation using its dynamic classifier chain to support subset queries (single or multiple metrics) and reduces error propagation via confidence-oriented decoding.

[311] Audio Signal Processing Using Time Domain Mel-Frequency Wavelet Coefficient

Rinku Sebastian, Simon O’Keefe, Martin Trefzer

Main category: cs.SD

TL;DR: Proposes TMFWC, a time-domain method combining MFCC and wavelet transform advantages while reducing computational complexity, and shows improved efficiency when combined with reservoir computing.

DetailsMotivation: MFCC lacks temporal information while wavelet transform has poor frequency resolution at low frequencies and doesn't align well with human auditory perception. Need a feature that combines both advantages.

Method: Time domain Mel frequency Wavelet Coefficient (TMFWC) - extracts Mel scale features in time domain by combining wavelet transform concepts, avoiding time-frequency conversion and reducing wavelet extraction complexity.

Result: Significantly improved efficiency of audio signal processing when combined with reservoir computing methodology.

Conclusion: TMFWC successfully incorporates MFCC and wavelet transform merits while reducing computational burden, making it an effective feature extraction method for speech processing.

Abstract: Extracting features from the speech is the most critical process in speech signal processing. Mel Frequency Cepstral Coefficients (MFCC) are the most widely used features in the majority of the speaker and speech recognition applications, as the filtering in this feature is similar to the filtering taking place in the human ear. But the main drawback of this feature is that it provides only the frequency information of the signal but does not provide the information about at what time which frequency is present. The wavelet transform, with its flexible time-frequency window, provides time and frequency information of the signal and is an appropriate tool for the analysis of non-stationary signals like speech. On the other hand, because of its uniform frequency scaling, a typical wavelet transform may be less effective in analysing speech signals, have poorer frequency resolution in low frequencies, and be less in line with human auditory perception. Hence, it is necessary to develop a feature that incorporates the merits of both MFCC and wavelet transform. A great deal of studies are trying to combine both these features. The present Wavelet Transform based Mel-scaled feature extraction methods require more computation when a wavelet transform is applied on top of Mel-scale filtering, since it adds extra processing steps. Here we are proposing a method to extract Mel scale features in time domain combining the concept of wavelet transform, thus reducing the computational burden of time-frequency conversion and the complexity of wavelet extraction. Combining our proposed Time domain Mel frequency Wavelet Coefficient(TMFWC) technique with the reservoir computing methodology has significantly improved the efficiency of audio signal processing.

cs.LG

[312] A Practitioner’s Guide to Kolmogorov-Arnold Networks

Amir Noorizadegan, Sifan Wang, Leevan Ling

Main category: cs.LG

TL;DR: A comprehensive review of Kolmogorov-Arnold Networks (KANs) covering theoretical foundations, architectural variants, implementation strategies, and practical guidance for practitioners.

DetailsMotivation: To provide a systematic overview of the rapidly expanding KAN landscape, bridging the conceptual gap between KANs and MLPs, and mapping the vibrant ecosystem supporting KAN development.

Method: Systematic collection and categorization of open-source implementations, analysis of basis function choices (B-splines, polynomials, ReLU, Gaussian RBFs, Fourier series), and synthesis of techniques for accuracy, efficiency, and regularization.

Result: Established formal equivalence between KANs and MLPs, highlighted KAN’s superior parameter efficiency, and created a practical guide for selecting appropriate KAN architectures.

Conclusion: Identified current research gaps and provided a structured reference for ongoing KAN research through an associated GitHub repository.

Abstract: Kolmogorov-Arnold Networks (KANs) have recently emerged as a promising alternative to traditional Multilayer Perceptrons (MLPs), inspired by the Kolmogorov-Arnold representation theorem. Unlike MLPs, which use fixed activation functions on nodes, KANs employ learnable univariate basis functions on edges, offering enhanced expressivity and interpretability. This review provides a systematic and comprehensive overview of the rapidly expanding KAN landscape, moving beyond simple performance comparisons to offer a structured synthesis of theoretical foundations, architectural variants, and practical implementation strategies. By collecting and categorizing a vast array of open-source implementations, we map the vibrant ecosystem supporting KAN development. We begin by bridging the conceptual gap between KANs and MLPs, establishing their formal equivalence and highlighting the superior parameter efficiency of the KAN formulation. A central theme of our review is the critical role of the basis function; we survey a wide array of choices, including B-splines, Chebyshev and Jacobi polynomials, ReLU compositions, Gaussian RBFs, and Fourier series, and analyze their respective trade-offs in terms of smoothness, locality, and computational cost. We then categorize recent advancements into a clear roadmap, covering techniques for improving accuracy, efficiency, and regularization. Key topics include physics-informed loss design, adaptive sampling, domain decomposition, hybrid architectures, and specialized methods for handling discontinuities. Finally, we provide a practical “Choose-Your-KAN” guide to help practitioners select appropriate architectures, and we conclude by identifying current research gaps. The associated GitHub repository https://github.com/AmirNoori68/kan-review complements this paper and serves as a structured reference for ongoing KAN research.

[313] Learning Interpretable Features in Audio Latent Spaces via Sparse Autoencoders

Nathan Paek, Yongyi Zang, Qihui Yang, Randal Leistikow

Main category: cs.LG

TL;DR: A framework for interpreting audio generative models by mapping latent representations to human-interpretable acoustic concepts using sparse autoencoders and linear mappings to acoustic properties.

DetailsMotivation: Sparse autoencoders work well for language models but face challenges in audio generation due to audio's dense nature obscuring semantic meaning and limited automatic feature characterization.

Method: Train SAEs on audio autoencoder latents, then learn linear mappings from SAE features to discretized acoustic properties (pitch, amplitude, timbre). Validate on continuous (DiffRhythm-VAE) and discrete (EnCodec, WavTokenizer) audio latent spaces.

Result: Enables controllable manipulation and analysis of AI music generation process, revealing how acoustic properties emerge during synthesis. Successfully analyzed DiffRhythm model to show evolution of pitch, timbre, and loudness throughout generation.

Conclusion: The framework provides interpretable analysis of audio generative models and can be extended to visual latent space generation models.

Abstract: While sparse autoencoders (SAEs) successfully extract interpretable features from language models, applying them to audio generation faces unique challenges: audio’s dense nature requires compression that obscures semantic meaning, and automatic feature characterization remains limited. We propose a framework for interpreting audio generative models by mapping their latent representations to human-interpretable acoustic concepts. We train SAEs on audio autoencoder latents, then learn linear mappings from SAE features to discretized acoustic properties (pitch, amplitude, and timbre). This enables both controllable manipulation and analysis of the AI music generation process, revealing how acoustic properties emerge during synthesis. We validate our approach on continuous (DiffRhythm-VAE) and discrete (EnCodec, WavTokenizer) audio latent spaces, and analyze DiffRhythm, a state-of-the-art text-to-music model, to demonstrate how pitch, timbre, and loudness evolve throughout generation. While our work is only done on audio modality, our framework can be extended to interpretable analysis of visual latent space generation models.

[314] HiMAE: Hierarchical Masked Autoencoders Discover Resolution-Specific Structure in Wearable Time Series

Simon A. Lee, Cyrus Tanade, Hao Zhou, Juhyeon Lee, Megha Thukral, Minji Han, Rachel Choi, Md Sazzad Hissain Khan, Baiying Lu, Migyeong Gwak, Mehrab Bin Morshed, Viswam Nathan, Md Mahbubur Rahman, Li Zhu, Subramaniam Venkatraman, Sharanya Arcot Desai

Main category: cs.LG

TL;DR: HiMAE is a hierarchical masked autoencoder that learns multi-resolution embeddings from wearable sensor data, outperforming state-of-the-art models while being compact enough for edge inference on smartwatches.

DetailsMotivation: To understand how temporal resolution affects predictive utility in wearable sensor data, hypothesizing that different clinical outcomes rely on structure at distinct temporal scales.

Method: HiMAE combines masked autoencoding with a hierarchical convolutional encoder-decoder to produce multi-resolution embeddings, treating temporal resolution as a probe for interpretability rather than a hyperparameter.

Result: HiMAE consistently outperforms state-of-the-art foundation models across classification, regression, and generative benchmarks while being orders of magnitude smaller and achieving sub-millisecond inference on smartwatch CPUs.

Conclusion: HiMAE serves as both an efficient self-supervised learning method and a discovery tool for identifying scale-sensitive structure in wearable health data, enabling true edge inference.

Abstract: Wearable sensors provide abundant physiological time series, yet the principles governing their predictive utility remain unclear. We hypothesize that temporal resolution is a fundamental axis of representation learning, with different clinical and behavioral outcomes relying on structure at distinct scales. To test this resolution hypothesis, we introduce HiMAE (Hierarchical Masked Autoencoder), a self supervised framework that combines masked autoencoding with a hierarchical convolutional encoder decoder. HiMAE produces multi resolution embeddings that enable systematic evaluation of which temporal scales carry predictive signal, transforming resolution from a hyperparameter into a probe for interpretability. Across classification, regression, and generative benchmarks, HiMAE consistently outperforms state of the art foundation models that collapse scale, while being orders of magnitude smaller. HiMAE is an efficient representation learner compact enough to run entirely on watch, achieving sub millisecond inference on smartwatch class CPUs for true edge inference. Together, these contributions position HiMAE as both an efficient self supervised learning method and a discovery tool for scale sensitive structure in wearable health.

[315] SHA-256 Infused Embedding-Driven Generative Modeling of High-Energy Molecules in Low-Data Regimes

Siddharth Verma, Alankar Alankar

Main category: cs.LG

TL;DR: A novel approach combining LSTM networks for molecular generation and Attentive GNNs for property prediction to discover high-energy materials, achieving 67.5% validity and 37.5% novelty, with 37 new super explosives identified.

DetailsMotivation: High-energy materials discovery is constrained by limited experimental data and restricted access to testing facilities, requiring computational approaches to accelerate discovery.

Method: Combines LSTM networks for molecular generation with Attentive Graph Neural Networks for property prediction, using a novel embedding strategy that integrates fixed SHA-256 embeddings with partially trainable representations.

Result: Achieved 67.5% validity and 37.5% novelty in generated molecules, with mean Tanimoto coefficient of 0.214 indicating diverse chemical space. Identified 37 new super explosives with predicted detonation velocity higher than 9 km/s.

Conclusion: The proposed framework successfully generates diverse and novel high-energy molecules, demonstrating potential for accelerating discovery of super explosives without requiring pretraining.

Abstract: High-energy materials (HEMs) are critical for propulsion and defense domains, yet their discovery remains constrained by experimental data and restricted access to testing facilities. This work presents a novel approach toward high-energy molecules by combining Long Short-Term Memory (LSTM) networks for molecular generation and Attentive Graph Neural Networks (GNN) for property predictions. We propose a transformative embedding space construction strategy that integrates fixed SHA-256 embeddings with partially trainable representations. Unlike conventional regularization techniques, this changes the representational basis itself, reshaping the molecular input space before learning begins. Without recourse to pretraining, the generator achieves 67.5% validity and 37.5% novelty. The generated library exhibits a mean Tanimoto coefficient of 0.214 relative to training set signifying the ability of framework to generate a diverse chemical space. We identified 37 new super explosives higher than 9 km/s predicted detonation velocity.

[316] The Kinetics of Reasoning: How Chain-of-Thought Shapes Learning in Transformers?

Zihan Pengmei, Costas Mavromatis, Zhengyuan Shen, Yunyi Zhang, Vassilis N. Ioannidis, Huzefa Rangwala

Main category: cs.LG

TL;DR: CoT supervision improves transformer performance but depends on task complexity. Models show transient trace unfaithfulness before aligning reasoning with answers. CoT accelerates generalization but doesn’t overcome high algorithmic complexity tasks.

DetailsMotivation: To understand how transformers learn to follow and benefit from chain-of-thought (CoT) supervision, particularly the learning dynamics and mechanisms behind CoT's effectiveness.

Method: Pretrained transformers on symbolic reasoning tasks with tunable complexity and controllable data composition. Compared two settings: final answers only vs. CoT traces before answers. Used three-parameter logistic curve modeling to quantify learning dynamics.

Result: CoT generally improves performance but benefits depend on task complexity. Models exhibit transient trace unfaithfulness early in training. CoT accelerates generalization but cannot overcome high algorithmic complexity tasks like list intersections.

Conclusion: CoT supervision accelerates transformer learning and alters internal computation mechanisms, but its effectiveness is bounded by task complexity. Trace faithfulness emerges dynamically during training, and kinetic modeling provides insights into transformer learning dynamics.

Abstract: Chain-of-thought (CoT) supervision can substantially improve transformer performance, yet the mechanisms by which models learn to follow and benefit from CoT remain poorly understood. We investigate these learning dynamics through the lens of grokking by pretraining transformers on symbolic reasoning tasks with tunable algorithmic complexity and controllable data composition to study their generalization. Models were trained under two settings: (i) producing only final answers, and (ii) emitting explicit CoT traces before answering. Our results show that while CoT generally improves task performance, its benefits depend on task complexity. To quantify these effects, we model the accuracy of the logarithmic training steps with a three-parameter logistic curve, revealing how the learning speed and shape vary with task complexity, data distribution, and the presence of CoT supervision. We also uncover a transient trace unfaithfulness phase: early in training, models often produce correct answers while skipping or contradicting CoT steps, before later aligning their reasoning traces with answers. Empirically, we (1) demonstrate that CoT accelerates generalization but does not overcome tasks with higher algorithmic complexity, such as finding list intersections; (2) introduce a kinetic modeling framework for understanding transformer learning; (3) characterize trace faithfulness as a dynamic property that emerges over training; and (4) show CoT alters internal transformer computation mechanistically.

[317] Optimal Information Combining for Multi-Agent Systems Using Adaptive Bias Learning

Siavash M. Alamouti, Fay Arjomandi

Main category: cs.LG

TL;DR: The paper introduces a theoretical framework to determine when bias learning in multi-agent systems is worthwhile, using a learnability ratio to quantify bias predictability and presenting the ABLOC algorithm that achieves performance improvements bounded by this ratio.

DetailsMotivation: Multi-agent systems suffer performance degradation from systematic biases that vary with environmental conditions, leading to inaccurate monitoring, unreliable predictions, and flawed human judgment aggregation. Current approaches either ignore biases or require expensive calibration.

Method: Developed a theoretical framework decomposing biases into learnable systematic and irreducible stochastic components. Introduced learnability ratio as fraction of bias variance predictable from covariates. Created ABLOC algorithm that iteratively learns bias-correcting transformations and optimizes combination weights with closed-form solutions.

Result: Systems with high learnability ratios achieved 40%-70% of theoretical maximum performance improvement, while those with low learnability showed minimal benefit. The learnability ratio provides quantitative bounds on achievable performance improvement.

Conclusion: The learnability ratio serves as a diagnostic criterion for practical deployment decisions, determining when bias learning is worthwhile versus simpler approaches. ABLOC algorithm guarantees convergence to theoretical performance bounds.

Abstract: Modern multi-agent systems ranging from sensor networks monitoring critical infrastructure to crowdsourcing platforms aggregating human intelligence can suffer significant performance degradation due to systematic biases that vary with environmental conditions. Current approaches either ignore these biases, leading to suboptimal decisions, or require expensive calibration procedures that are often infeasible in practice. This performance gap has real consequences: inaccurate environmental monitoring, unreliable financial predictions, and flawed aggregation of human judgments. This paper addresses the fundamental question: when can we learn and correct for these unknown biases to recover near-optimal performance, and when is such learning futile? We develop a theoretical framework that decomposes biases into learnable systematic components and irreducible stochastic components, introducing the concept of learnability ratio as the fraction of bias variance predictable from observable covariates. This ratio determines whether bias learning is worthwhile for a given system. We prove that the achievable performance improvement is fundamentally bounded by this learnability ratio, providing system designers with quantitative guidance on when to invest in bias learning versus simpler approaches. We present the Adaptive Bias Learning and Optimal Combining (ABLOC) algorithm, which iteratively learns bias-correcting transformations while optimizing combination weights through closedform solutions, guaranteeing convergence to these theoretical bounds. Experimental validation demonstrates that systems with high learnability ratios can recover significant performance (we achieved 40%-70% of theoretical maximum improvement in our examples), while those with low learnability show minimal benefit, validating our diagnostic criteria for practical deployment decisions.

[318] Network-Constrained Policy Optimization for Adaptive Multi-agent Vehicle Routing

Fazel Arasteh, Arian Haghparast, Manos Papagelis

Main category: cs.LG

TL;DR: Proposes two multi-agent reinforcement learning approaches for dynamic vehicle routing: Adaptive Navigation (AN) for decentralized intersection-level control, and Hierarchical Hub-based Adaptive Navigation (HHAN) for scalable routing in large networks using hub coordination.

DetailsMotivation: Traditional SPF algorithms worsen congestion in multi-vehicle settings by routing all vehicles along identical paths. Need for coordinated, network-aware fleet navigation that adapts to dynamic traffic conditions.

Method: AN uses decentralized MARL with Graph Attention Networks for local traffic modeling. HHAN extends AN with hierarchical hub-based routing where agents control hub-to-hub routing while SPF handles micro-routing within hub regions, using centralized training with decentralized execution under A-QMIX framework.

Result: AN reduces average travel time vs SPF and learning baselines with 100% routing success. HHAN scales to hundreds of intersections, achieving up to 15.9% improvement under heavy traffic on synthetic grids and real urban maps (Toronto, Manhattan).

Conclusion: Network-constrained MARL enables scalable, coordinated, and congestion-aware routing for intelligent transportation systems, with hierarchical approaches providing effective scaling to large urban networks.

Abstract: Traffic congestion in urban road networks leads to longer trip times and higher emissions, especially during peak periods. While the Shortest Path First (SPF) algorithm is optimal for a single vehicle in a static network, it performs poorly in dynamic, multi-vehicle settings, often worsening congestion by routing all vehicles along identical paths. We address dynamic vehicle routing through a multi-agent reinforcement learning (MARL) framework for coordinated, network-aware fleet navigation. We first propose Adaptive Navigation (AN), a decentralized MARL model where each intersection agent provides routing guidance based on (i) local traffic and (ii) neighborhood state modeled using Graph Attention Networks (GAT). To improve scalability in large networks, we further propose Hierarchical Hub-based Adaptive Navigation (HHAN), an extension of AN that assigns agents only to key intersections (hubs). Vehicles are routed hub-to-hub under agent control, while SPF handles micro-routing within each hub region. For hub coordination, HHAN adopts centralized training with decentralized execution (CTDE) under the Attentive Q-Mixing (A-QMIX) framework, which aggregates asynchronous vehicle decisions via attention. Hub agents use flow-aware state features that combine local congestion and predictive dynamics for proactive routing. Experiments on synthetic grids and real urban maps (Toronto, Manhattan) show that AN reduces average travel time versus SPF and learning baselines, maintaining 100% routing success. HHAN scales to networks with hundreds of intersections, achieving up to 15.9% improvement under heavy traffic. These findings highlight the potential of network-constrained MARL for scalable, coordinated, and congestion-aware routing in intelligent transportation systems.

[319] Non-myopic Matching and Rebalancing in Large-Scale On-Demand Ride-Pooling Systems Using Simulation-Informed Reinforcement Learning

Farnoosh Namdarpour, Joseph Y. J. Chow

Main category: cs.LG

TL;DR: This paper proposes a simulation-informed reinforcement learning approach for ride-pooling systems to address myopic decision-making, improving service rates, reducing wait times, and enabling significant fleet size reductions.

DetailsMotivation: Ride-pooling services currently suffer from myopic decision-making that overlooks long-term effects of dispatch decisions, limiting their potential benefits in cost reduction and congestion mitigation.

Method: Extends learning and planning framework from ride-hailing to ride-pooling by embedding ride-pooling simulation within RL mechanism, using n-step temporal difference learning on simulated experiences to derive spatiotemporal state values.

Result: Non-myopic policy increases service rate by up to 8.4%, reduces wait and in-vehicle times, and can decrease fleet size by over 25% while maintaining performance. Adding rebalancing further cuts wait time by 27.3% and in-vehicle time by 12.5%.

Conclusion: The proposed simulation-informed RL approach enables effective non-myopic decision-making in ride-pooling systems, offering significant performance improvements and cost savings for operators while enhancing passenger experience.

Abstract: Ride-pooling, also known as ride-sharing, shared ride-hailing, or microtransit, is a service wherein passengers share rides. This service can reduce costs for both passengers and operators and reduce congestion and environmental impacts. A key limitation, however, is its myopic decision-making, which overlooks long-term effects of dispatch decisions. To address this, we propose a simulation-informed reinforcement learning (RL) approach. While RL has been widely studied in the context of ride-hailing systems, its application in ride-pooling systems has been less explored. In this study, we extend the learning and planning framework of Xu et al. (2018) from ride-hailing to ride-pooling by embedding a ride-pooling simulation within the learning mechanism to enable non-myopic decision-making. In addition, we propose a complementary policy for rebalancing idle vehicles. By employing n-step temporal difference learning on simulated experiences, we derive spatiotemporal state values and subsequently evaluate the effectiveness of the non-myopic policy using NYC taxi request data. Results demonstrate that the non-myopic policy for matching can increase the service rate by up to 8.4% versus a myopic policy while reducing both in-vehicle and wait times for passengers. Furthermore, the proposed non-myopic policy can decrease fleet size by over 25% compared to a myopic policy, while maintaining the same level of performance, thereby offering significant cost savings for operators. Incorporating rebalancing operations into the proposed framework cuts wait time by up to 27.3%, in-vehicle time by 12.5%, and raises service rate by 15.1% compared to using the framework for matching decisions alone at the cost of increased vehicle minutes traveled per passenger.

[320] MemEIC: A Step Toward Continual and Compositional Knowledge Editing

Jin Seong, Jiyun Park, Wencke Liermann, Hongseok Choi, Yoonji Nam, Hyun Kim, Soojong Lim, Namhoon Lee

Main category: cs.LG

TL;DR: MemEIC enables continual and compositional knowledge editing in large vision-language models by using a hybrid external-internal editor with dual memory and LoRA adapters for cross-modal updates.

DetailsMotivation: Current knowledge editing methods focus on single modalities in isolation, neglecting the multimodal nature of LVLMs and continuous knowledge updates, leading to suboptimal outcomes.

Method: Uses a hybrid external-internal editor with dual external memory for cross-modal evidence retrieval, dual LoRA adapters for disentangled parameter updates per modality, and a brain-inspired knowledge connector for compositional reasoning.

Result: Significantly improves performance on complex multimodal questions and effectively preserves prior edits.

Conclusion: Sets a new benchmark for continual and compositional knowledge editing in large vision-language models.

Abstract: The dynamic nature of information necessitates continuously updating large vision-language models (LVLMs). While recent knowledge editing techniques hint at promising directions, they often focus on editing a single modality (vision or language) in isolation. This prevalent practice neglects the inherent multimodality of LVLMs and the continuous nature of knowledge updates, potentially leading to suboptimal editing outcomes when considering the interplay between modalities and the need for ongoing knowledge refinement. To address these limitations, we propose MemEIC, a novel method for Continual and Compositional Knowledge Editing (CCKE) in LVLMs. MemEIC enables compositional editing of both visual and textual knowledge sequentially. Our approach employs a hybrid external-internal editor featuring a dual external memory for cross-modal evidence retrieval and dual LoRA adapters that facilitate disentangled parameter updates for each modality. A key component is a brain-inspired knowledge connector, activated selectively for compositional reasoning, that integrates information across different modalities. Experiments demonstrate that MemEIC significantly improves performance on complex multimodal questions and effectively preserves prior edits, setting a new benchmark for CCKE in LVLMs.

[321] FreIE: Low-Frequency Spectral Bias in Neural Networks for Time-Series Tasks

Jialong Sun, Xinpeng Ling, Jiaxuan Zou, Jiawen Kang, Kejia Zhang

Main category: cs.LG

TL;DR: The paper identifies spectral bias as a universal phenomenon in time series prediction models and proposes FreLE, a plug-and-play loss function that uses frequency regularization to improve model generalization.

DetailsMotivation: To address the challenge of time series autocorrelation and unify understanding of spectral bias phenomenon across different models, recognizing it as a universal characteristic rather than architecture-specific.

Method: Conducted extensive empirical experiments to measure spectral bias in mainstream models, then proposed FreLE algorithm that enhances model generalization through explicit and implicit frequency regularization.

Result: Found that virtually all models exhibit spectral bias phenomenon. FreLE demonstrated superior performance in extensive experiments as a plug-and-play loss function unit.

Conclusion: Spectral bias is a universal phenomenon in time series prediction models, and the proposed FreLE algorithm effectively mitigates its impact through frequency regularization, improving model generalization.

Abstract: The inherent autocorrelation of time series data presents an ongoing challenge to multivariate time series prediction. Recently, a widely adopted approach has been the incorporation of frequency domain information to assist in long-term prediction tasks. Many researchers have independently observed the spectral bias phenomenon in neural networks, where models tend to fit low-frequency signals before high-frequency ones. However, these observations have often been attributed to the specific architectures designed by the researchers, rather than recognizing the phenomenon as a universal characteristic across models. To unify the understanding of the spectral bias phenomenon in long-term time series prediction, we conducted extensive empirical experiments to measure spectral bias in existing mainstream models. Our findings reveal that virtually all models exhibit this phenomenon. To mitigate the impact of spectral bias, we propose the FreLE (Frequency Loss Enhancement) algorithm, which enhances model generalization through both explicit and implicit frequency regularization. This is a plug-and-play model loss function unit. A large number of experiments have proven the superior performance of FreLE. Code is available at https://github.com/Chenxing-Xuan/FreLE.

[322] Adaptive Context Length Optimization with Low-Frequency Truncation for Multi-Agent Reinforcement Learning

Wenchang Duan, Yaoliang Yu, Jiwan He, Yi Shi

Main category: cs.LG

TL;DR: Proposes a novel MARL framework with adaptive context length optimization using temporal gradient analysis and Fourier-based low-frequency truncation to filter redundant information, achieving SOTA performance on long-term dependency tasks.

DetailsMotivation: Large fixed context lengths in deep MARL lead to limited exploration efficiency and redundant information, hindering optimal performance in complex environments.

Method: Uses a central agent to dynamically optimize context length via temporal gradient analysis, combined with Fourier-based low-frequency truncation to extract global temporal trends and filter redundant information.

Result: Achieves state-of-the-art performance on PettingZoo, MiniGrid, Google Research Football (GRF), and StarCraft Multi-Agent Challenge v2 (SMACv2) benchmarks.

Conclusion: The proposed adaptive context length optimization framework effectively addresses exploration limitations in MARL and demonstrates superior performance on challenging long-term dependency tasks.

Abstract: Recently, deep multi-agent reinforcement learning (MARL) has demonstrated promising performance for solving challenging tasks, such as long-term dependencies and non-Markovian environments. Its success is partly attributed to conditioning policies on large fixed context length. However, such large fixed context lengths may lead to limited exploration efficiency and redundant information. In this paper, we propose a novel MARL framework to obtain adaptive and effective contextual information. Specifically, we design a central agent that dynamically optimizes context length via temporal gradient analysis, enhancing exploration to facilitate convergence to global optima in MARL. Furthermore, to enhance the adaptive optimization capability of the context length, we present an efficient input representation for the central agent, which effectively filters redundant information. By leveraging a Fourier-based low-frequency truncation method, we extract global temporal trends across decentralized agents, providing an effective and efficient representation of the MARL environment. Extensive experiments demonstrate that the proposed method achieves state-of-the-art (SOTA) performance on long-term dependency tasks, including PettingZoo, MiniGrid, Google Research Football (GRF), and StarCraft Multi-Agent Challenge v2 (SMACv2).

[323] Metis-SPECS: Decoupling Multimodal Learning via Self-distilled Preference-based Cold Start

Kun Chen, Peng Shi, Haibo Qiu, Zhixiong Zeng, Siqi Yang, Wenji Mao, Lin Ma

Main category: cs.LG

TL;DR: SPECS is a self-distilled, preference-based cold start framework that replaces SFT-based initialization with preference learning, decoupling multimodal learning into surface-form criteria learning and deep reasoning, achieving better generalization and performance.

DetailsMotivation: SFT-based cold start methods induce instruction-style overfitting and weaken out-of-distribution generalization, which negatively impacts downstream RL performance. The authors aim to find better cold start approaches that improve generalization.

Method: SPECS framework: (1) generates introspective preference data pairs via self-distillation without external teachers, (2) performs preference-based training (e.g., DPO) focusing on shallow surface-form criteria rather than content memorization, (3) hands off to RL with verifiable rewards for deep reasoning.

Result: Experimental results show consistent performance gains: 4.1% improvement on MEGA-Bench and 12.2% on MathVista. SPECS reduces in-distribution “stuckness,” improves exploration, stabilizes training, and raises performance ceiling.

Conclusion: Preference-based training methods generalize better than SFT-based methods in cold start. The decoupled learning framework of SPECS provides superior initialization for RL, leading to better performance and generalization across multimodal benchmarks.

Abstract: Reinforcement learning (RL) with verifiable rewards has recently catalyzed a wave of “MLLM-r1” approaches that bring RL to vision language models. Most representative paradigms begin with a cold start, typically employing supervised fine-tuning (SFT), to initialize the policy before RL. However, SFT-based cold start adopts the reasoning paradigm intertwined with task solution and output format, which may induce instruction-style overfitting, weakens out-of-distribution generalization, and ultimately affects downstream RL. We revisit the cold start along two views, its training method and data construction, and introduce the Generalization Factor (GF) coefficient to quantify the generalization capability under different methods. Our empirical study finds that preference-based training methods (e.g. DPO) generalizes better than SFT-based methods in cold start. Motivated by this, we propose SPECS-a Self-distilled, Preference-based Cold Start framework that decouples multimodal learning: (1) generates introspective preference data pairs via self-distillation, avoiding reliance on larger teachers or manual annotation; (2) performs preference-based training to learn, focusing on shallow, transferable surface-form criteria (format, structure, style) rather than memorizing content; and (3) hands off to RL with verifiable rewards for deep reasoning results. Experimental results across multiple multimodal benchmarks show that our decoupling learning framework yields consistent performance gains over strong baselines, improving MEGA-Bench by 4.1% and MathVista by 12.2%. Additional experiments indicate that SPECS contributes to reducing in-distribution “stuckness,” improving exploration, stabilizing training, and raising the performance ceiling.

[324] Mixture-of-Experts Operator Transformer for Large-Scale PDE Pre-Training

Hong Wang, Haiyang Xin, Jie Wang, Xuanze Yang, Fei Zha, Huanshuo Dong, Yan Jiang

Main category: cs.LG

TL;DR: Proposes MoE-POT, a sparse-activated Mixture-of-Experts architecture for neural operators that efficiently scales parameters while controlling inference costs, achieving 40% error reduction with fewer activated parameters.

DetailsMotivation: Address challenges in PDE neural operator pre-training: heterogeneity of PDE datasets causing high mixed training errors, and dense pre-training models having high inference costs.

Method: Uses layer-wise router-gating network to dynamically select 4 routed experts from 16 expert networks during inference, plus 2 shared experts to capture common PDE properties. Final output is weighted average of activated experts.

Result: Pre-trained models from 30M to 0.5B parameters on 6 PDE datasets. Model with 90M activated parameters achieves 40% zero-shot error reduction compared to existing 120M parameter models. Router decisions can infer dataset types.

Conclusion: MoE-POT effectively addresses PDE dataset heterogeneity and inference cost issues through sparse activation, with interpretability validating the architecture’s rationality.

Abstract: Pre-training has proven effective in addressing data scarcity and performance limitations in solving PDE problems with neural operators. However, challenges remain due to the heterogeneity of PDE datasets in equation types, which leads to high errors in mixed training. Additionally, dense pre-training models that scale parameters by increasing network width or depth incur significant inference costs. To tackle these challenges, we propose a novel Mixture-of-Experts Pre-training Operator Transformer (MoE-POT), a sparse-activated architecture that scales parameters efficiently while controlling inference costs. Specifically, our model adopts a layer-wise router-gating network to dynamically select 4 routed experts from 16 expert networks during inference, enabling the model to focus on equation-specific features. Meanwhile, we also integrate 2 shared experts, aiming to capture common properties of PDE and reduce redundancy among routed experts. The final output is computed as the weighted average of the results from all activated experts. We pre-train models with parameters from 30M to 0.5B on 6 public PDE datasets. Our model with 90M activated parameters achieves up to a 40% reduction in zero-shot error compared with existing models with 120M activated parameters. Additionally, we conduct interpretability analysis, showing that dataset types can be inferred from router-gating network decisions, which validates the rationality and effectiveness of the MoE architecture.

[325] PRESTO: Preimage-Informed Instruction Optimization for Prompting Black-Box LLMs

Jaewon Chu, Seunghun Lee, Hyunwoo J. Kim

Main category: cs.LG

TL;DR: PRESTO is a novel framework that leverages the preimage structure of soft prompts to optimize instructions for black-box LLMs more efficiently by sharing evaluation scores, using preimage-based initialization, and enforcing score consistency regularization.

DetailsMotivation: Black-box LLMs are widely used but their internal parameters are inaccessible, making instruction optimization challenging. Existing methods using white-box LLMs suffer from redundant queries due to many-to-one mapping of soft prompts to instructions.

Method: PRESTO uses three components: score sharing (sharing evaluation scores across all soft prompts in a preimage), preimage-based initialization (selecting initial points to maximize search space coverage), and score consistency regularization (enforcing prediction consistency within preimages).

Result: PRESTO effectively obtains 14 times more scored data under the same query budget, demonstrating superior performance on 33 instruction optimization tasks compared to previous methods.

Conclusion: The preimage structure, previously seen as a hindrance, can be leveraged as useful prior knowledge to accelerate instruction optimization for black-box LLMs, making PRESTO an efficient and effective framework.

Abstract: Large language models (LLMs) have achieved remarkable success across diverse domains, due to their strong instruction-following capabilities. This has led to increasing interest in optimizing instructions for black-box LLMs, whose internal parameters are inaccessible but widely used due to their strong performance. To optimize instructions for black-box LLMs, recent methods employ white-box LLMs to generate candidate instructions from optimized soft prompts. However, white-box LLMs often map different soft prompts to the same instruction, leading to redundant queries. While previous studies regarded this many-to-one mapping as a structure that hinders optimization efficiency, we reinterpret it as a useful prior knowledge that can accelerate the optimization. To this end, we introduce PREimage-informed inSTruction Optimization (PRESTO), a novel framework that leverages the preimage structure of soft prompts for efficient optimization. PRESTO consists of three key components: (1) score sharing, which shares the evaluation score with all soft prompts in a preimage; (2) preimage-based initialization, which selects initial data points that maximize search space coverage using preimage information; and (3) score consistency regularization, which enforces prediction consistency within each preimage. By leveraging preimages, PRESTO achieves the effect of effectively obtaining 14 times more scored data under the same query budget, resulting in more efficient optimization. Experimental results on 33 instruction optimization tasks demonstrate the superior performance of PRESTO. Code is available at https://github.com/mlvlab/PRESTO

[326] ScaleDiff: Higher-Resolution Image Synthesis via Efficient and Model-Agnostic Diffusion

Sungho Koh, SeungJu Cha, Hyunwoo Oh, Kwanyoung Lee, Dong-Jin Kim

Main category: cs.LG

TL;DR: ScaleDiff is a training-free framework that extends resolution capabilities of pretrained diffusion models using Neighborhood Patch Attention for efficiency and Latent Frequency Mixing for detail generation.

DetailsMotivation: Text-to-image diffusion models perform poorly beyond their training resolution, and existing training-free methods are computationally expensive or incompatible with Diffusion Transformers.

Method: Proposes Neighborhood Patch Attention (NPA) to reduce computational redundancy in self-attention, integrates it with SDEdit pipeline, adds Latent Frequency Mixing for fine details, and uses Structure Guidance for global structure enhancement.

Result: Achieves state-of-the-art performance among training-free methods in both image quality and inference speed on both U-Net and Diffusion Transformer architectures.

Conclusion: ScaleDiff provides an efficient, model-agnostic solution for resolution extension without additional training, compatible with various diffusion model architectures.

Abstract: Text-to-image diffusion models often exhibit degraded performance when generating images beyond their training resolution. Recent training-free methods can mitigate this limitation, but they often require substantial computation or are incompatible with recent Diffusion Transformer models. In this paper, we propose ScaleDiff, a model-agnostic and highly efficient framework for extending the resolution of pretrained diffusion models without any additional training. A core component of our framework is Neighborhood Patch Attention (NPA), an efficient mechanism that reduces computational redundancy in the self-attention layer with non-overlapping patches. We integrate NPA into an SDEdit pipeline and introduce Latent Frequency Mixing (LFM) to better generate fine details. Furthermore, we apply Structure Guidance to enhance global structure during the denoising process. Experimental results demonstrate that ScaleDiff achieves state-of-the-art performance among training-free methods in terms of both image quality and inference speed on both U-Net and Diffusion Transformer architectures.

[327] MedVLSynther: Synthesizing High-Quality Visual Question Answering from Medical Documents with Generator-Verifier LMMs

Xiaoke Huang, Ningsen Wang, Hui Liu, Xianfeng Tang, Yuyin Zhou

Main category: cs.LG

TL;DR: MedVLSynther is a framework that automatically generates high-quality medical VQA questions from biomedical literature using a generator-verifier approach, creating the MedSynVQA dataset that improves LMM performance on medical benchmarks.

DetailsMotivation: Training general medical VQA systems is hindered by the lack of large, high-quality datasets, as existing medical VQA corpora are limited in size and not openly available.

Method: A rubric-guided generator-verifier framework that synthesizes multiple-choice VQA items from biomedical literature by conditioning on figures, captions, and references. The generator produces stems and options, while a multi-stage verifier enforces quality gates and awards points before acceptance.

Result: Created MedSynVQA with 13,087 questions over 14,803 images spanning 13 modalities and 28 anatomical regions. Training LMMs with this data improved accuracy across six benchmarks, achieving averages of 55.85 (3B) and 58.15 (7B), with up to 77.57 on VQA-RAD and 67.76 on PathVQA.

Conclusion: MedVLSynther provides an auditable, reproducible, and privacy-preserving approach to scalable medical VQA training data generation, with both generation and verification being essential for quality.

Abstract: Large Multimodal Models (LMMs) are increasingly capable of answering medical questions that require joint reasoning over images and text, yet training general medical VQA systems is impeded by the lack of large, openly usable, high-quality corpora. We present MedVLSynther, a rubric-guided generator-verifier framework that synthesizes high-quality multiple-choice VQA items directly from open biomedical literature by conditioning on figures, captions, and in-text references. The generator produces self-contained stems and parallel, mutually exclusive options under a machine-checkable JSON schema; a multi-stage verifier enforces essential gates (self-containment, single correct answer, clinical validity, image-text consistency), awards fine-grained positive points, and penalizes common failure modes before acceptance. Applying this pipeline to PubMed Central yields MedSynVQA: 13,087 audited questions over 14,803 images spanning 13 imaging modalities and 28 anatomical regions. Training open-weight LMMs with reinforcement learning using verifiable rewards improves accuracy across six medical VQA benchmarks, achieving averages of 55.85 (3B) and 58.15 (7B), with up to 77.57 on VQA-RAD and 67.76 on PathVQA, outperforming strong medical LMMs. A Ablations verify that both generation and verification are necessary and that more verified data consistently helps, and a targeted contamination analysis detects no leakage from evaluation suites. By operating entirely on open literature and open-weight models, MedVLSynther offers an auditable, reproducible, and privacy-preserving path to scalable medical VQA training data.

[328] $π_\texttt{RL}$: Online RL Fine-tuning for Flow-based Vision-Language-Action Models

Kang Chen, Zhihao Liu, Tonghe Zhang, Zhen Guo, Si Xu, Hao Lin, Hongzhi Zang, Quanlu Zhang, Zhaofei Yu, Guoliang Fan, Tiejun Huang, Yu Wang, Chao Yu

Main category: cs.LG

TL;DR: π_RL is an open-source framework that enables reinforcement learning for flow-based Vision-Language-Action models by addressing intractable action log-likelihoods through two novel RL algorithms: Flow-Noise and Flow-SDE.

DetailsMotivation: Applying large-scale RL to flow-based VLAs is challenging due to intractable action log-likelihoods from iterative denoising processes, which hinders automated data collection for scaling supervised fine-tuning.

Method: π_RL implements two RL algorithms: (1) Flow-Noise models denoising as discrete-time MDP with learnable noise network for exact log-likelihood computation, (2) Flow-SDE integrates denoising with agent-environment interaction using ODE-to-SDE conversion for efficient RL exploration.

Result: On LIBERO benchmark, π_RL boosted few-shot SFT models from 57.6% to 97.6% and from 77.1% to 98.3%. In ManiSkill with 320 parallel environments, it improved performance from 41.6% to 85.7% and from 40.0% to 84.8% across 4352 pick-and-place tasks.

Conclusion: π_RL achieves significant performance gains and stronger generalization over SFT models, validating the effectiveness of online RL for flow-based VLAs and demonstrating scalable multitask RL under heterogeneous simulation.

Abstract: Vision-Language-Action (VLA) models enable robots to understand and perform complex tasks from multimodal input. Although recent work explores using reinforcement learning (RL) to automate the laborious data collection process in scaling supervised fine-tuning (SFT), applying large-scale RL to flow-based VLAs (e.g., $\pi_0$, $\pi_{0.5}$) remains challenging due to intractable action log-likelihoods from iterative denoising. We address this challenge with $\pi_{\text{RL}}$, an open-source framework for training flow-based VLAs in parallel simulation. $\pi_{\text{RL}}$ implements two RL algorithms: (1) {Flow-Noise} models the denoising process as a discrete-time MDP with a learnable noise network for exact log-likelihood computation. (2) {Flow-SDE} integrates denoising with agent-environment interaction, formulating a two-layer MDP that employs ODE-to-SDE conversion for efficient RL exploration. We evaluate $\pi_{\text{RL}}$ on LIBERO and ManiSkill benchmarks. On LIBERO, $\pi_{\text{RL}}$ boosts few-shot SFT models $\pi_0$ and $\pi_{0.5}$ from 57.6% to 97.6% and from 77.1% to 98.3%, respectively. In ManiSkill, we train $\pi_{\text{RL}}$ in 320 parallel environments, improving $\pi_0$ from 41.6% to 85.7% and $\pi_{0.5}$ from 40.0% to 84.8% across 4352 pick-and-place tasks, demonstrating scalable multitask RL under heterogeneous simulation. Overall, $\pi_{\text{RL}}$ achieves significant performance gains and stronger generalization over SFT-models, validating the effectiveness of online RL for flow-based VLAs.

[329] Topology-Aware Active Learning on Graphs

Harris Hardiman-Mostow, Jack Mauro, Adrien Weihs, Andrea L. Bertozzi

Main category: cs.LG

TL;DR: A graph-topological active learning method using Balanced Forman Curvature for coreset construction and dynamic exploration-exploitation switching, with localized graph rewiring for improved label propagation.

DetailsMotivation: To address the core challenge of exploration versus exploitation in active learning under scarce label budgets, moving beyond hand-tuned heuristics.

Method: Uses Balanced Forman Curvature for coreset construction with data-driven stopping criterion, dynamic exploration-exploitation switching, and localized graph rewiring for multiscale information incorporation.

Result: Consistently outperforms existing graph-based semi-supervised baselines on benchmark classification tasks at low label rates.

Conclusion: The proposed graph-topological approach effectively handles exploration-exploitation trade-offs and improves performance in label-scarce scenarios.

Abstract: We propose a graph-topological approach to active learning that directly targets the core challenge of exploration versus exploitation under scarce label budgets. To guide exploration, we introduce a coreset construction algorithm based on Balanced Forman Curvature (BFC), which selects representative initial labels that reflect the graph’s cluster structure. This method includes a data-driven stopping criterion that signals when the graph has been sufficiently explored. We further use BFC to dynamically trigger the shift from exploration to exploitation within active learning routines, replacing hand-tuned heuristics. To improve exploitation, we introduce a localized graph rewiring strategy that efficiently incorporates multiscale information around labeled nodes, enhancing label propagation while preserving sparsity. Experiments on benchmark classification tasks show that our methods consistently outperform existing graph-based semi-supervised baselines at low label rates.

[330] Transferring Causal Effects using Proxies

Manuel Iglesias-Alonso, Felix Schur, Julius von Kügelgen, Jonas Peters

Main category: cs.LG

TL;DR: Methodology for estimating causal effects in multi-domain settings with unobserved confounders using proxy variables, with identifiability proofs and estimation techniques.

DetailsMotivation: Address the challenge of estimating causal effects when confounders are unobserved and effects vary across domains, using proxy variables as a solution.

Method: Proposed methodology using proxy variables for hidden confounders, with two estimation techniques for discrete/categorical variables, proving identifiability even for continuous treatments and responses.

Result: Proved identifiability under the conditions, introduced consistent estimation techniques with confidence intervals, validated through simulations and real-world application on website rankings and consumer choices.

Conclusion: The approach successfully enables causal effect estimation in target domains with unobserved confounders using proxy variables, with theoretical guarantees and empirical validation.

Abstract: We consider the problem of estimating a causal effect in a multi-domain setting. The causal effect of interest is confounded by an unobserved confounder and can change between the different domains. We assume that we have access to a proxy of the hidden confounder and that all variables are discrete or categorical. We propose methodology to estimate the causal effect in the target domain, where we assume to observe only the proxy variable. Under these conditions, we prove identifiability (even when treatment and response variables are continuous). We introduce two estimation techniques, prove consistency, and derive confidence intervals. The theoretical results are supported by simulation studies and a real-world example studying the causal effect of website rankings on consumer choices.

[331] Active Learning with Task-Driven Representations for Messy Pools

Kianoosh Ashouritaklimi, Tom Rainforth

Main category: cs.LG

TL;DR: Active learning with task-driven representations that update during the process outperforms fixed unsupervised representations for messy data pools.

DetailsMotivation: Current active learning approaches use fixed unsupervised representations which fail to capture task-relevant information in messy, uncurated data pools.

Method: Proposed two strategies: learning semi-supervised representations directly, and supervised fine-tuning of initial unsupervised representations, both updated periodically during active learning.

Result: Both proposed strategies significantly improved empirical performance compared to using unsupervised or pretrained representations.

Conclusion: Periodically updating task-driven representations during active learning is more effective than fixed representations for handling messy data pools.

Abstract: Active learning has the potential to be especially useful for messy, uncurated pools where datapoints vary in relevance to the target task. However, state-of-the-art approaches to this problem currently rely on using fixed, unsupervised representations of the pool, focusing on modifying the acquisition function instead. We show that this model setup can undermine their effectiveness at dealing with messy pools, as such representations can fail to capture important information relevant to the task. To address this, we propose using task-driven representations that are periodically updated during the active learning process using the previously collected labels. We introduce two specific strategies for learning these representations, one based on directly learning semi-supervised representations and the other based on supervised fine-tuning of an initial unsupervised representation. We find that both significantly improve empirical performance over using unsupervised or pretrained representations.

[332] Robust GNN Watermarking via Implicit Perception of Topological Invariants

Jipeng Li, Yannning Shen

Main category: cs.LG

TL;DR: InvGNN-WM is a trigger-free watermarking method for Graph Neural Networks that ties ownership to the model’s perception of graph invariants, enabling robust black-box verification without affecting task performance.

DetailsMotivation: Existing GNN watermarks rely on backdoor triggers that are vulnerable to common model edits and create ownership ambiguity, necessitating a more robust and trigger-free approach.

Method: Uses a lightweight head to predict normalized algebraic connectivity on owner-private carrier sets, with sign-sensitive bit decoding and calibrated thresholding for false-positive rate control.

Result: Achieves comparable clean accuracy while providing higher watermark accuracy than trigger- and compression-based baselines across diverse datasets and backbones. Remains robust under pruning, fine-tuning, and quantization.

Conclusion: The method provides theoretical guarantees for imperceptibility and robustness, proves exact removal is NP-complete, and demonstrates practical effectiveness in protecting GNN intellectual property.

Abstract: Graph Neural Networks (GNNs) are valuable intellectual property, yet many watermarks rely on backdoor triggers that break under common model edits and create ownership ambiguity. We present InvGNN-WM, which ties ownership to a model’s implicit perception of a graph invariant, enabling trigger-free, black-box verification with negligible task impact. A lightweight head predicts normalized algebraic connectivity on an owner-private carrier set; a sign-sensitive decoder outputs bits, and a calibrated threshold controls the false-positive rate. Across diverse node and graph classification datasets and backbones, InvGNN-WM matches clean accuracy while yielding higher watermark accuracy than trigger- and compression-based baselines. It remains strong under unstructured pruning, fine-tuning, and post-training quantization; plain knowledge distillation (KD) weakens the mark, while KD with a watermark loss (KD+WM) restores it. We provide guarantees for imperceptibility and robustness, and we prove that exact removal is NP-complete.

[333] Modular Linear Tokenization (MLT)

Tcharlies Schmitz

Main category: cs.LG

TL;DR: MLT is a reversible, deterministic encoding method for high-cardinality categorical data using modular arithmetic and linear transformations, offering compact vector representations with full reversibility.

DetailsMotivation: Traditional categorical encoding methods like hashing or one-hot encodings lack reversibility and control over dimensionality, especially for large-scale categorical data.

Method: Leverages modular arithmetic over finite fields and invertible linear transformations to create bijective mappings between categorical identifiers and numerical vectors.

Result: On MovieLens 20M dataset, MLT achieves comparable performance to supervised embeddings with significantly fewer parameters and lower training costs.

Conclusion: MLT provides an efficient, scalable alternative to traditional categorical encoding methods with explicit dimensionality control and full reversibility.

Abstract: This paper introduces Modular Linear Tokenization (MLT), a reversible and deterministic technique for encoding high-cardinality categorical identifiers into compact numerical vectors. Unlike traditional hashing or one-hot encodings, MLT preserves bijective mappings by leveraging modular arithmetic over finite fields and invertible linear transformations. The method offers explicit control of dimensionality and computational scalability while maintaining full reversibility, even for millions of identifiers. Experimental results on the MovieLens 20M dataset show that MLT achieves comparable predictive performance to supervised embeddings while requiring significantly fewer parameters and lower training cost. An open-source implementation of MLT is available on PyPI (https://pypi.org/project/light-mlt/) and GitHub (https://github.com/tcharliesschmitz/light-mlt).

[334] Application and Validation of Geospatial Foundation Model Data for the Prediction of Health Facility Programmatic Outputs – A Case Study in Malawi

Lynn Metz, Rachel Haggard, Michael Moszczynski, Samer Asbah, Chris Mwase, Patricia Khomani, Tyler Smith, Hannah Cooper, Annie Mwale, Arbaaz Muslim, Gautam Prasad, Mimi Sun, Tomer Shekel, Joydeep Paul, Anna Carter, Shravya Shetty, Dylan Green

Main category: cs.LG

TL;DR: GeoFM embeddings improve prediction of health outcomes in LMICs compared to traditional methods, with multi-source integration providing the most robust results.

DetailsMotivation: Routine health data in LMICs suffers from reporting delays and incomplete coverage, requiring novel data sources and analytics to improve reliability.

Method: Evaluated three GeoFM embedding sources (Google PDFM, AlphaEarth, mobile CDR) using XGBoost models on 552 health catchment areas in Malawi, comparing to traditional geospatial interpolation methods with 80/20 train-test split and 5-fold CV.

Result: Embedding-based approaches improved upon baseline methods in 13 of 15 (87%) indicators. Multi-GeoFM model achieved best performance with R2 values of 0.63-0.68 for key indicators like population density, HIV cases, and vaccinations.

Conclusion: Integration of multiple GeoFM sources is an efficient and valuable tool for supplementing and strengthening constrained routine health information systems in LMICs.

Abstract: The reliability of routine health data in low and middle-income countries (LMICs) is often constrained by reporting delays and incomplete coverage, necessitating the exploration of novel data sources and analytics. Geospatial Foundation Models (GeoFMs) offer a promising avenue by synthesizing diverse spatial, temporal, and behavioral data into mathematical embeddings that can be efficiently used for downstream prediction tasks. This study evaluated the predictive performance of three GeoFM embedding sources - Google Population Dynamics Foundation Model (PDFM), Google AlphaEarth (derived from satellite imagery), and mobile phone call detail records (CDR) - for modeling 15 routine health programmatic outputs in Malawi, and compared their utility to traditional geospatial interpolation methods. We used XGBoost models on data from 552 health catchment areas (January 2021-May 2023), assessing performance with R2, and using an 80/20 training and test data split with 5-fold cross-validation used in training. While predictive performance was mixed, the embedding-based approaches improved upon baseline geostatistical methods in 13 of 15 (87%) indicators tested. A Multi-GeoFM model integrating all three embedding sources produced the most robust predictions, achieving average 5-fold cross validated R2 values for indicators like population density (0.63), new HIV cases (0.57), and child vaccinations (0.47) and test set R2 of 0.64, 0.68, and 0.55, respectively. Prediction was poor for prediction targets with low primary data availability, such as TB and malnutrition cases. These results demonstrate that GeoFM embeddings imbue a modest predictive improvement for select health and demographic outcomes in an LMIC context. We conclude that the integration of multiple GeoFM sources is an efficient and valuable tool for supplementing and strengthening constrained routine health information systems.

[335] On the Dataless Training of Neural Networks

Alvaro Velasquez, Susmit Jha, Ismail R. Alkhouri

Main category: cs.LG

TL;DR: Survey of neural networks for optimization without training data, categorizing dataless neural network methods into architecture-agnostic and architecture-specific approaches.

DetailsMotivation: Two key factors: data-driven learning approaches are underdeveloped in areas like combinatorial optimization, and training data is inherently limited in applications like medical image reconstruction.

Method: Re-parameterizing optimization problems using neural network architectures (MLP, convolutional, graph, quadratic) in dataless settings, categorized into architecture-agnostic and architecture-specific methods.

Result: The approach has gained recent attention due to promising results across diverse applications including combinatorial optimization, inverse problems, and partial differential equations.

Conclusion: Defines dataless neural network setting and clarifies distinctions from related concepts like zero-shot learning, lifting in optimization, and over-parameterization.

Abstract: This paper surveys studies on the use of neural networks for optimization in the training-data-free setting. Specifically, we examine the dataless application of neural network architectures in optimization by re-parameterizing problems using fully connected (or MLP), convolutional, graph, and quadratic neural networks. Although MLPs have been used to solve linear programs a few decades ago, this approach has recently gained increasing attention due to its promising results across diverse applications, including those based on combinatorial optimization, inverse problems, and partial differential equations. The motivation for this setting stems from two key (possibly over-lapping) factors: (i) data-driven learning approaches are still underdeveloped and have yet to demonstrate strong results, as seen in combinatorial optimization, and (ii) the availability of training data is inherently limited, such as in medical image reconstruction and other scientific applications. In this paper, we define the dataless setting and categorize it into two variants based on how a problem instance – defined by a single datum – is encoded onto the neural network: (i) architecture-agnostic methods and (ii) architecture-specific methods. Additionally, we discuss similarities and clarify distinctions between the dataless neural network (dNN) settings and related concepts such as zero-shot learning, one-shot learning, lifting in optimization, and over-parameterization.

[336] Contrastive Predictive Coding Done Right for Mutual Information Estimation

J. Jon Ryu, Pavan Yeddanapudi, Xiangxiang Xu, Gregory W. Wornell

Main category: cs.LG

TL;DR: InfoNCE is not a valid mutual information estimator. The paper introduces InfoNCE-anchor, a modified version that reduces bias for accurate MI estimation, and generalizes the framework using proper scoring rules.

DetailsMotivation: InfoNCE has been widely used for mutual information estimation despite its indirect connection to MI, and the authors aim to demonstrate why it's not a valid MI estimator and provide a better alternative.

Method: Introduces InfoNCE-anchor with an auxiliary anchor class for consistent density ratio estimation, and generalizes the framework using proper scoring rules that unify various contrastive objectives.

Result: InfoNCE-anchor with log score achieves the most accurate MI estimates, but in self-supervised learning, the anchor doesn’t improve downstream task performance, showing that contrastive learning benefits from learning structured density ratios rather than accurate MI estimation.

Conclusion: Contrastive representation learning benefits from learning structured density ratios, not from accurate mutual information estimation itself, as demonstrated by the improved MI estimation but unchanged downstream performance with InfoNCE-anchor.

Abstract: The InfoNCE objective, originally introduced for contrastive representation learning, has become a popular choice for mutual information (MI) estimation, despite its indirect connection to MI. In this paper, we demonstrate why InfoNCE should not be regarded as a valid MI estimator, and we introduce a simple modification, which we refer to as InfoNCE-anchor, for accurate MI estimation. Our modification introduces an auxiliary anchor class, enabling consistent density ratio estimation and yielding a plug-in MI estimator with significantly reduced bias. Beyond this, we generalize our framework using proper scoring rules, which recover InfoNCE-anchor as a special case when the log score is employed. This formulation unifies a broad spectrum of contrastive objectives, including NCE, InfoNCE, and $f$-divergence variants, under a single principled framework. Empirically, we find that InfoNCE-anchor with the log score achieves the most accurate MI estimates; however, in self-supervised representation learning experiments, we find that the anchor does not improve the downstream task performance. These findings corroborate that contrastive representation learning benefits not from accurate MI estimation per se, but from the learning of structured density ratios.

[337] A General and Streamlined Differentiable Optimization Framework

Andrew W. Rosemberg, Joaquim Dias Garcia, François Pacaud, Robert B. Parker, Benoît Legat, Kaarthik Sundar, Russell Bent, Pascal Van Hentenryck

Main category: cs.LG

TL;DR: DiffOpt.jl is a Julia framework that enables automatic differentiation through constrained optimization problems, providing forward- and reverse-mode sensitivities for smooth programs by differentiating KKT systems.

DetailsMotivation: There's a growing need to differentiate through optimization problems for learning, control, and decision-making systems, but practical integration remains challenging due to solver specialization and interface mismatches.

Method: The framework computes solution and objective sensitivities by differentiating the KKT system under standard regularity assumptions, using a parameter-centric API that allows declaring named parameters and obtaining derivatives directly with respect to them.

Result: The framework successfully handles convex and nonconvex models including economic dispatch, portfolio selection, and robot inverse kinematics, and enables gradient-based methods for energy market bidding and training optimization proxies.

Conclusion: Differentiable optimization can be deployed as a routine tool for experimentation, learning, calibration, and design while maintaining standard JuMP modeling practices and access to solver ecosystems.

Abstract: Differentiating through constrained optimization problems is increasingly central to learning, control, and large-scale decision-making systems, yet practical integration remains challenging due to solver specialization and interface mismatches. This paper presents a general and streamlined framework-an updated DiffOpt.jl-that unifies modeling and differentiation within the Julia optimization stack. The framework computes forward - and reverse-mode solution and objective sensitivities for smooth, potentially nonconvex programs by differentiating the KKT system under standard regularity assumptions. A first-class, JuMP-native parameter-centric API allows users to declare named parameters and obtain derivatives directly with respect to them - even when a parameter appears in multiple constraints and objectives - eliminating brittle bookkeeping from coefficient-level interfaces. We illustrate these capabilities on convex and nonconvex models, including economic dispatch, mean-variance portfolio selection with conic risk constraints, and nonlinear robot inverse kinematics. Two companion studies further demonstrate impact at scale: gradient-based iterative methods for strategic bidding in energy markets and Sobolev-style training of end-to-end optimization proxies using solver-accurate sensitivities. Together, these results demonstrate that differentiable optimization can be deployed as a routine tool for experimentation, learning, calibration, and design-without deviating from standard JuMP modeling practices and while retaining access to a broad ecosystem of solvers.

[338] Efficient Online Learning with Predictive Coding Networks: Exploiting Temporal Correlations

Darius Masoum Zadeh-Jousdani, Elvin Hajizada, Eyke Hüllermeier

Main category: cs.LG

TL;DR: PCN-TA is a biologically plausible predictive coding network that reduces computational overhead by preserving latent states across temporal frames, achieving 50% fewer inference steps than baseline PC and 10% fewer weight updates than backpropagation.

DetailsMotivation: Need for efficient online learning algorithms for robotic edge systems that can continuously adapt to changing environments, with biologically plausible alternatives to traditional backpropagation that are suitable for neuromorphic hardware.

Method: Predictive Coding Network with Temporal Amortization (PCN-TA) that preserves latent states across temporal frames to leverage temporal correlations and reduce computational demands while maintaining learning performance.

Result: On COIL-20 robotic perception dataset, PCN-TA achieves 10% fewer weight updates compared to backpropagation and requires 50% fewer inference steps than baseline PC networks, with reduced computational overhead.

Conclusion: PCN-TA enables efficient online learning at the edge with reduced computational demands, making it suitable for resource-constrained robotic systems and promising for future neuromorphic hardware implementations.

Abstract: Robotic systems operating at the edge require efficient online learning algorithms that can continuously adapt to changing environments while processing streaming sensory data. Traditional backpropagation, while effective, conflicts with biological plausibility principles and may be suboptimal for continuous adaptation scenarios. The Predictive Coding (PC) framework offers a biologically plausible alternative with local, Hebbian-like update rules, making it suitable for neuromorphic hardware implementation. However, PC’s main limitation is its computational overhead due to multiple inference iterations during training. We present Predictive Coding Network with Temporal Amortization (PCN-TA), which preserves latent states across temporal frames. By leveraging temporal correlations, PCN-TA significantly reduces computational demands while maintaining learning performance. Our experiments on the COIL-20 robotic perception dataset demonstrate that PCN-TA achieves 10% fewer weight updates compared to backpropagation and requires 50% fewer inference steps than baseline PC networks. These efficiency gains directly translate to reduced computational overhead for moving another step toward edge deployment and real-time adaptation support in resource-constrained robotic systems. The biologically-inspired nature of our approach also makes it a promising candidate for future neuromorphic hardware implementations, enabling efficient online learning at the edge.

[339] Infrequent Exploration in Linear Bandits

Harin Lee, Min-hwan Oh

Main category: cs.LG

TL;DR: INFEX is a framework for infrequent exploration in linear bandits that balances between continuous exploration and purely greedy approaches, achieving efficient regret with reduced exploration frequency.

DetailsMotivation: Addresses the gap between fully adaptive exploration methods (which explore too frequently) and purely greedy approaches (which fail without diversity), particularly for safety-critical or costly domains where continuous exploration is impractical.

Method: INFEX executes a base exploratory policy according to a schedule while predominantly choosing greedy actions in between, providing a modular framework that can integrate any adaptive exploration method.

Result: Theoretical analysis shows INFEX achieves instance-dependent regret matching standard efficient algorithms when exploration frequency exceeds a logarithmic threshold. Empirical evaluations confirm state-of-the-art regret performance and runtime improvements.

Conclusion: INFEX provides a practical, general framework for infrequent exploration that maintains theoretical guarantees while offering computational efficiency and wide applicability across domains where frequent exploration is undesirable.

Abstract: We study the problem of infrequent exploration in linear bandits, addressing a significant yet overlooked gap between fully adaptive exploratory methods (e.g., UCB and Thompson Sampling), which explore potentially at every time step, and purely greedy approaches, which require stringent diversity assumptions to succeed. Continuous exploration can be impractical or unethical in safety-critical or costly domains, while purely greedy strategies typically fail without adequate contextual diversity. To bridge these extremes, we introduce a simple and practical framework, INFEX, explicitly designed for infrequent exploration. INFEX executes a base exploratory policy according to a given schedule while predominantly choosing greedy actions in between. Despite its simplicity, our theoretical analysis demonstrates that INFEX achieves instance-dependent regret matching standard provably efficient algorithms, provided the exploration frequency exceeds a logarithmic threshold. Additionally, INFEX is a general, modular framework that allows seamless integration of any fully adaptive exploration method, enabling wide applicability and ease of adoption. By restricting intensive exploratory computations to infrequent intervals, our approach can also enhance computational efficiency. Empirical evaluations confirm our theoretical findings, showing state-of-the-art regret performance and runtime improvements over existing methods.

[340] Dual Mixture-of-Experts Framework for Discrete-Time Survival Analysis

Hyeonjun Lee, Hyungseob Shin, Gunhee Nam, Hyeonsoo Lee

Main category: cs.LG

TL;DR: A dual mixture-of-experts framework for discrete-time survival analysis that combines feature-encoder MoE for subgroup-aware representation learning with hazard MoE for temporal dynamics, improving performance on breast cancer datasets.

DetailsMotivation: To address the challenge of modeling patient heterogeneity while adapting risk predictions to individual characteristics and temporal dynamics in survival analysis.

Method: Proposes a dual mixture-of-experts framework with feature-encoder MoE for subgroup-aware representation learning and hazard MoE that leverages patient features and time embeddings to capture temporal dynamics.

Result: Consistently improves performance on METABRIC and GBSG breast cancer datasets, boosting time-dependent C-index up to 0.04 on test sets, with further gains when incorporated into Consurv framework.

Conclusion: The dual-MoE design effectively integrates with existing deep learning survival pipelines and provides flexible modeling of patient heterogeneity and temporal dynamics in survival analysis.

Abstract: Survival analysis is a task to model the time until an event of interest occurs, widely used in clinical and biomedical research. A key challenge is to model patient heterogeneity while also adapting risk predictions to both individual characteristics and temporal dynamics. We propose a dual mixture-of-experts (MoE) framework for discrete-time survival analysis. Our approach combines a feature-encoder MoE for subgroup-aware representation learning with a hazard MoE that leverages patient features and time embeddings to capture temporal dynamics. This dual-MoE design flexibly integrates with existing deep learning based survival pipelines. On METABRIC and GBSG breast cancer datasets, our method consistently improves performance, boosting the time-dependent C-index up to 0.04 on the test sets, and yields further gains when incorporated into the Consurv framework.

[341] Exploring Human-AI Conceptual Alignment through the Prism of Chess

Semyon Lomaso, Judah Goldfeder, Mehmet Hamza Erol, Matthew So, Yao Yan, Addison Howard, Nathan Kutz, Ravid Shwartz Ziv

Main category: cs.LG

TL;DR: AI chess models achieve grandmaster-level play but show a paradox: early layers encode human chess concepts well (85% accuracy), while deeper layers that drive better performance drift to alien representations (50-65% accuracy). When tested on Chess960 (random starting positions), concept recognition drops 10-20%, revealing reliance on memorized patterns rather than abstract understanding.

DetailsMotivation: To investigate whether AI systems truly understand human concepts or merely mimic surface patterns, using chess as a domain where human creativity meets precise strategic concepts.

Method: Analyzed a 270M-parameter transformer chess model, conducted layer-wise analysis of concept encoding, and introduced the first Chess960 dataset with 240 expert-annotated positions across 6 strategic concepts to test robustness beyond memorization.

Result: Early layers encoded human concepts like center control and knight outposts with 85% accuracy, but deeper layers dropped to 50-65% accuracy despite superior performance. On Chess960, concept recognition dropped 10-20%, showing reliance on memorized patterns rather than abstract understanding.

Conclusion: Current AI architectures face a fundamental tension: representations that win games diverge from those that align with human thinking. As AI systems optimize for performance, they develop increasingly alien intelligence, posing challenges for creative AI applications requiring genuine human-AI collaboration.

Abstract: Do AI systems truly understand human concepts or merely mimic surface patterns? We investigate this through chess, where human creativity meets precise strategic concepts. Analyzing a 270M-parameter transformer that achieves grandmaster-level play, we uncover a striking paradox: while early layers encode human concepts like center control and knight outposts with up to 85% accuracy, deeper layers, despite driving superior performance, drift toward alien representations, dropping to 50-65% accuracy. To test conceptual robustness beyond memorization, we introduce the first Chess960 dataset: 240 expert-annotated positions across 6 strategic concepts. When opening theory is eliminated through randomized starting positions, concept recognition drops 10-20% across all methods, revealing the model’s reliance on memorized patterns rather than abstract understanding. Our layer-wise analysis exposes a fundamental tension in current architectures: the representations that win games diverge from those that align with human thinking. These findings suggest that as AI systems optimize for performance, they develop increasingly alien intelligence, a critical challenge for creative AI applications requiring genuine human-AI collaboration. Dataset and code are available at: https://github.com/slomasov/ChessConceptsLLM.

[342] Do Students Debias Like Teachers? On the Distillability of Bias Mitigation Methods

Jiali Cheng, Chirag Agarwal, Hadi Amiri

Main category: cs.LG

TL;DR: Knowledge distillation undermines model debiasing capabilities, with significant variations across bias types, but three solutions are proposed to improve debiasing distillability.

DetailsMotivation: To investigate how knowledge distillation affects model robustness against spurious correlations and debiasing capabilities, which remains underexplored.

Method: Extensive experiments on natural language inference and image classification tasks, analyzing attention patterns and internal circuits post-distillation.

Result: KD undermines debiasing capabilities; debiased models don’t benefit from teacher knowledge; robustness varies across bias types; identified specific attention patterns causing distinct behavior.

Conclusion: First study on KD’s effect on debiasing, providing insights for designing better debiasing methods through data augmentation, iterative KD, and weight initialization.

Abstract: Knowledge distillation (KD) is an effective method for model compression and transferring knowledge between models. However, its effect on model’s robustness against spurious correlations that degrade performance on out-of-distribution data remains underexplored. This study investigates the effect of knowledge distillation on the transferability of ``debiasing’’ capabilities from teacher models to student models on natural language inference (NLI) and image classification tasks. Through extensive experiments, we illustrate several key findings: (i) overall the debiasing capability of a model is undermined post-KD; (ii) training a debiased model does not benefit from injecting teacher knowledge; (iii) although the overall robustness of a model may remain stable post-distillation, significant variations can occur across different types of biases; and (iv) we pin-point the internal attention pattern and circuit that causes the distinct behavior post-KD. Given the above findings, we propose three effective solutions to improve the distillability of debiasing methods: developing high quality data for augmentation, implementing iterative knowledge distillation, and initializing student models with weights obtained from teacher models. To the best of our knowledge, this is the first study on the effect of KD on debiasing and its interenal mechanism at scale. Our findings provide understandings on how KD works and how to design better debiasing methods.

[343] Towards Scaling Laws for Symbolic Regression

David Otte, Jörg K. H. Franke, Frank Hutter

Main category: cs.LG

TL;DR: This paper presents the first systematic investigation of scaling laws in symbolic regression, showing that both validation loss and solved rate follow clear power-law trends with compute, similar to scaling laws in language modeling.

DetailsMotivation: Symbolic regression aims to discover mathematical expressions from data for scientific insight and interpretable models. While deep learning-based SR has become competitive with genetic programming, the role of scale remained largely unexplored.

Method: Used a scalable end-to-end transformer pipeline with carefully generated training data, testing across five different model sizes spanning three orders of magnitude in compute to study scaling behavior.

Result: Found that validation loss and solved rate follow power-law trends with compute, identified compute-optimal hyperparameter scaling (batch size and learning rate grow with model size), and determined optimal token-to-parameter ratio of ≈15.

Conclusion: Symbolic regression performance is largely predictable from compute, providing important insights for training next-generation SR models and demonstrating the applicability of scaling laws to this domain.

Abstract: Symbolic regression (SR) aims to discover the underlying mathematical expressions that explain observed data. This holds promise for both gaining scientific insight and for producing inherently interpretable and generalizable models for tabular data. In this work we focus on the basics of SR. Deep learning-based SR has recently become competitive with genetic programming approaches, but the role of scale has remained largely unexplored. Inspired by scaling laws in language modeling, we present the first systematic investigation of scaling in SR, using a scalable end-to-end transformer pipeline and carefully generated training data. Across five different model sizes and spanning three orders of magnitude in compute, we find that both validation loss and solved rate follow clear power-law trends with compute. We further identify compute-optimal hyperparameter scaling: optimal batch size and learning rate grow with model size, and a token-to-parameter ratio of $\approx$15 is optimal in our regime, with a slight upward trend as compute increases. These results demonstrate that SR performance is largely predictable from compute and offer important insights for training the next generation of SR models.

[344] Learning Geometry: A Framework for Building Adaptive Manifold Models through Metric Optimization

Di Zhang

Main category: cs.LG

TL;DR: A novel machine learning paradigm that treats models as malleable geometric entities by optimizing metric tensor fields on manifolds, enabling dynamic geometric structure shaping beyond traditional parameter optimization.

DetailsMotivation: To move beyond traditional parameter optimization by treating models as geometric entities, allowing for more expressive power and dynamic adaptation through geometric structure optimization.

Method: Constructs a variational framework with loss balancing data fidelity and geometric complexity, discretizes continuous manifolds into triangular meshes, parameterizes metric tensor by edge lengths, and uses automatic differentiation for efficient optimization.

Result: Developed a practical computational method for infinite-dimensional optimization, established theoretical analogy with Einstein-Hilbert action in general relativity, and demonstrated greater expressive power than fixed-geometry models.

Conclusion: This work provides foundation for dynamic meta-learners that can autonomously evolve geometry and topology, with broad applications in scientific model discovery and robust representation learning.

Abstract: This paper proposes a novel paradigm for machine learning that moves beyond traditional parameter optimization. Unlike conventional approaches that search for optimal parameters within a fixed geometric space, our core idea is to treat the model itself as a malleable geometric entity. Specifically, we optimize the metric tensor field on a manifold with a predefined topology, thereby dynamically shaping the geometric structure of the model space. To achieve this, we construct a variational framework whose loss function carefully balances data fidelity against the intrinsic geometric complexity of the manifold. The former ensures the model effectively explains observed data, while the latter acts as a regularizer, penalizing overly curved or irregular geometries to encourage simpler models and prevent overfitting. To address the computational challenges of this infinite-dimensional optimization problem, we introduce a practical method based on discrete differential geometry: the continuous manifold is discretized into a triangular mesh, and the metric tensor is parameterized by edge lengths, enabling efficient optimization using automatic differentiation tools. Theoretical analysis reveals a profound analogy between our framework and the Einstein-Hilbert action in general relativity, providing an elegant physical interpretation for the concept of “data-driven geometry”. We further argue that even with fixed topology, metric optimization offers significantly greater expressive power than models with fixed geometry. This work lays a solid foundation for constructing fully dynamic “meta-learners” capable of autonomously evolving their geometry and topology, and it points to broad application prospects in areas such as scientific model discovery and robust representation learning.

[345] New Money: A Systematic Review of Synthetic Data Generation for Finance

James Meldrum, Basem Suleiman, Fethi Rabhi, Muhammad Johan Alibasa

Main category: cs.LG

TL;DR: Systematic review of 72 studies on synthetic financial data generation using GANs and VAEs, finding GANs dominate for time-series market data and tabular credit data, with gaps in privacy evaluation.

DetailsMotivation: Address challenges of using sensitive financial data in ML by creating artificial datasets that preserve statistical properties while mitigating privacy risks and regulatory constraints.

Method: Systematic review and analysis of 72 studies published since 2018, categorizing types of financial information synthesized, generative methods employed, and evaluation strategies.

Result: GAN-based approaches dominate literature, particularly for time-series market data and tabular credit data. Innovative techniques show potential but lack rigorous privacy evaluation.

Conclusion: Highlights critical research gaps and offers guidance for future work on robust, privacy-preserving synthetic data solutions for financial domain.

Abstract: Synthetic data generation has emerged as a promising approach to address the challenges of using sensitive financial data in machine learning applications. By leveraging generative models, such as Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs), it is possible to create artificial datasets that preserve the statistical properties of real financial records while mitigating privacy risks and regulatory constraints. Despite the rapid growth of this field, a comprehensive synthesis of the current research landscape has been lacking. This systematic review consolidates and analyses 72 studies published since 2018 that focus on synthetic financial data generation. We categorise the types of financial information synthesised, the generative methods employed, and the evaluation strategies used to assess data utility and privacy. The findings indicate that GAN-based approaches dominate the literature, particularly for generating time-series market data and tabular credit data. While several innovative techniques demonstrate potential for improved realism and privacy preservation, there remains a notable lack of rigorous evaluation of privacy safeguards across studies. By providing an integrated overview of generative techniques, applications, and evaluation methods, this review highlights critical research gaps and offers guidance for future work aimed at developing robust, privacy-preserving synthetic data solutions for the financial domain.

[346] Nirvana: A Specialized Generalist Model With Task-Aware Memory Mechanism

Yuhua Jiang, Shuang Cheng, Yihao Liu, Ermo Hua, Che Jiang, Weigao Sun, Yu Cheng, Feifei Gao, Biqing Qi, Bowen Zhou

Main category: cs.LG

TL;DR: Nirvana is a Specialized Generalist Model with specialized memory mechanism, linear time complexity, and test-time task information extraction that achieves competitive performance on both general language tasks and specialized medical tasks.

DetailsMotivation: Traditional LLM structures lack specialized memory mechanisms guided by task information, limiting their ability to achieve expert-level performance in target domains while preserving broad capabilities.

Method: Proposes Task-Aware Memory Trigger that adjusts memory mechanism based on task requirements, treating each sample as self-supervised fine-tuning, and Specialized Memory Updater that dynamically memorizes context guided by Trigger.

Result: Achieves competitive/superior results on natural language benchmarks and higher-quality MRI reconstruction compared to conventional models and traditional LLM backbones, with accurate clinical report generation.

Conclusion: Nirvana’s specialized memory mechanism enables effective adaptation to domain shifts and achieves expert-level performance in specialized tasks while maintaining general capabilities.

Abstract: Specialized Generalist Models (SGMs) aim to preserve broad capabilities while achieving expert-level performance in target domains. However, traditional LLM structures including Transformer, Linear Attention, and hybrid models do not employ specialized memory mechanism guided by task information. In this paper, we present Nirvana, an SGM with specialized memory mechanism, linear time complexity, and test-time task information extraction. Besides, we propose the Task-Aware Memory Trigger ($\textit{Trigger}$) that flexibly adjusts memory mechanism based on the current task’s requirements. In Trigger, each incoming sample is treated as a self-supervised fine-tuning task, enabling Nirvana to adapt its task-related parameters on the fly to domain shifts. We also design the Specialized Memory Updater ($\textit{Updater}$) that dynamically memorizes the context guided by Trigger. We conduct experiments on both general language tasks and specialized medical tasks. On a variety of natural language modeling benchmarks, Nirvana achieves competitive or superior results compared to the existing LLM structures. To prove the effectiveness of Trigger on specialized tasks, we test Nirvana’s performance on a challenging medical task, i.e., Magnetic Resonance Imaging (MRI). We post-train frozen Nirvana backbone with lightweight codecs on paired electromagnetic signals and MRI images. Despite the frozen Nirvana backbone, Trigger guides the model to adapt to the MRI domain with the change of task-related parameters. Nirvana achieves higher-quality MRI reconstruction compared to conventional MRI models as well as the models with traditional LLMs’ backbone, and can also generate accurate preliminary clinical reports accordingly.

[347] LLMBisect: Breaking Barriers in Bug Bisection with A Comparative Analysis Pipeline

Zheng Zhang, Haonan Li, Xingyu Li, Hang Zhang, Zhiyun Qian

Main category: cs.LG

TL;DR: A multi-stage LLM pipeline for bug bisection that significantly outperforms traditional methods by comprehensively analyzing patches and commit messages.

DetailsMotivation: Traditional patch-based bisection methods have limitations: they assume bug-inducing and patch commits modify same functions, ignore commit message information, and rely on simple heuristics without logical vulnerability analysis.

Method: Proposes a comprehensive multi-stage pipeline using LLMs to: (1) fully utilize patch information, (2) compare multiple candidate commits in context, and (3) progressively narrow down candidates through down-selection steps.

Result: Achieves 38% better accuracy than state-of-the-art solution and 60% improvement over baseline LLM-based bisection method.

Conclusion: LLMs are well-positioned to break barriers of existing solutions by comprehending both textual data and code, and the multi-stage pipeline is essential for accurate bug-inducing commit identification.

Abstract: Bug bisection has been an important security task that aims to understand the range of software versions impacted by a bug, i.e., identifying the commit that introduced the bug. However, traditional patch-based bisection methods are faced with several significant barriers: For example, they assume that the bug-inducing commit (BIC) and the patch commit modify the same functions, which is not always true. They often rely solely on code changes, while the commit message frequently contains a wealth of vulnerability-related information. They are also based on simple heuristics (e.g., assuming the BIC initializes lines deleted in the patch) and lack any logical analysis of the vulnerability. In this paper, we make the observation that Large Language Models (LLMs) are well-positioned to break the barriers of existing solutions, e.g., comprehend both textual data and code in patches and commits. Unlike previous BIC identification approaches, which yield poor results, we propose a comprehensive multi-stage pipeline that leverages LLMs to: (1) fully utilize patch information, (2) compare multiple candidate commits in context, and (3) progressively narrow down the candidates through a series of down-selection steps. In our evaluation, we demonstrate that our approach achieves significantly better accuracy than the state-of-the-art solution by more than 38%. Our results further confirm that the comprehensive multi-stage pipeline is essential, as it improves accuracy by 60% over a baseline LLM-based bisection method.

[348] SAFE: A Novel Approach to AI Weather Evaluation through Stratified Assessments of Forecasts over Earth

Nick Masi, Randall Balestriero

Main category: cs.LG

TL;DR: SAFE is a package for evaluating weather forecast models using stratified assessments across different geospatial attributes like countries, income levels, and landcover, revealing performance disparities that global averages miss.

DetailsMotivation: Current ML evaluation averages performance globally, ignoring non-uniform human development and geography, which masks important performance disparities across different regions and populations.

Method: Developed SAFE package that integrates various data domains to stratify geospatial gridpoints by territory, global subregion, income, and landcover, enabling individual stratum performance analysis.

Result: Benchmarked state-of-the-art AI weather models and found they all exhibit forecasting skill disparities across every attribute, enabling creation of a benchmark for model forecast fairness.

Conclusion: SAFE enables moving beyond globally-averaged metrics to identify where models perform best/worst and which are most fair, providing open-source tools for stratified evaluation in weather and climate applications.

Abstract: The dominant paradigm in machine learning is to assess model performance based on average loss across all samples in some test set. This amounts to averaging performance geospatially across the Earth in weather and climate settings, failing to account for the non-uniform distribution of human development and geography. We introduce Stratified Assessments of Forecasts over Earth (SAFE), a package for elucidating the stratified performance of a set of predictions made over Earth. SAFE integrates various data domains to stratify by different attributes associated with geospatial gridpoints: territory (usually country), global subregion, income, and landcover (land or water). This allows us to examine the performance of models for each individual stratum of the different attributes (e.g., the accuracy in every individual country). To demonstrate its importance, we utilize SAFE to benchmark a zoo of state-of-the-art AI-based weather prediction models, finding that they all exhibit disparities in forecasting skill across every attribute. We use this to seed a benchmark of model forecast fairness through stratification at different lead times for various climatic variables. By moving beyond globally-averaged metrics, we for the first time ask: where do models perform best or worst, and which models are most fair? To support further work in this direction, the SAFE package is open source and available at https://github.com/N-Masi/safe

[349] Do Not Step Into the Same River Twice: Learning to Reason from Trial and Error

Chenming Tang, Hsiu-Yuan Huang, Weijie Liu, Saiyong Yang, Yunfang Wu

Main category: cs.LG

TL;DR: LTE (Learning to reason from Trial and Error) is a novel RLVR approach that enhances LLM reasoning by using self-generated incorrect answers and overlong responses as hints, without requiring external expert guidance.

DetailsMotivation: Existing RLVR approaches suffer from exploration stagnation due to training only on LLMs' own responses, limiting learning from training data. External expert guidance solutions have limited availability.

Method: LTE leverages LLMs’ previously self-generated incorrect answers and problem of overlong responses as hints during training, enabling learning from trial and error without external guidance.

Result: LTE outperforms normal group relative policy optimization (GRPO) by 6.38 in Pass@1 and 9.00 in Pass@k on average across six mathematics benchmarks for Qwen3-4B-Base.

Conclusion: LTE successfully mitigates exploration stagnation and enhances both exploitation and exploration during training, demonstrating effectiveness in improving LLM reasoning capabilities.

Abstract: Reinforcement learning with verifiable rewards (RLVR) has significantly boosted the reasoning capability of large language models (LLMs) recently. However, existing RLVR approaches merely train LLMs based on their own generated responses and are constrained by the initial capability of LLMs, thus prone to exploration stagnation, in which LLMs fail to solve more training problems and cannot further learn from the training data. Some work tries to address this by leveraging off-policy solutions to training problems but requires external guidance from experts which suffers from limited availability. In this work, we propose LTE (Learning to reason from Trial and Error), an approach hinting LLMs with their previously self-generated incorrect answers and problem of overlong responses, which does not require any external expert guidance. Experiments validate the effectiveness of LTE, which outperforms the normal group relative policy optimization (GRPO) by 6.38 in Pass@1 and 9.00 in Pass@k on average across six mathematics benchmarks for Qwen3-4B-Base. Further analysis confirms that LTE successfully mitigates the problem of exploration stagnation and enhances both exploitation and exploration during training.

[350] maxVSTAR: Maximally Adaptive Vision-Guided CSI Sensing with Closed-Loop Edge Model Adaptation for Robust Human Activity Recognition

Kexing Liu

Main category: cs.LG

TL;DR: maxVSTAR is a vision-guided framework that autonomously adapts WiFi CSI-based human activity recognition models to overcome domain shift issues on edge devices, using a cross-modal teacher-student architecture with YOLO vision supervision.

DetailsMotivation: WiFi CSI-based HAR faces severe performance degradation due to domain shift under varying environmental and hardware conditions when deployed on edge devices, requiring a solution that maintains privacy while enabling continuous adaptation.

Method: A closed-loop, vision-guided model adaptation framework with cross-modal teacher-student architecture. Uses high-accuracy YOLO-based vision model as dynamic supervisory signal to provide real-time activity labels for CSI data stream, enabling autonomous online fine-tuning of lightweight CSI-based HAR model (STAR) at the edge.

Result: Baseline STAR model accuracy dropped from 93.52% to 49.14% on uncalibrated hardware. After single vision-guided adaptation cycle, maxVSTAR restored accuracy to 81.51%, demonstrating effective dynamic adaptation capability.

Conclusion: maxVSTAR establishes a scalable paradigm for long-term autonomous HAR using CSI sensing at network edge, enabling dynamic self-supervised model adaptation in privacy-conscious IoT environments without manual intervention.

Abstract: WiFi Channel State Information (CSI)-based human activity recognition (HAR) provides a privacy-preserving, device-free sensing solution for smart environments. However, its deployment on edge devices is severely constrained by domain shift, where recognition performance deteriorates under varying environmental and hardware conditions. This study presents maxVSTAR (maximally adaptive Vision-guided Sensing Technology for Activity Recognition), a closed-loop, vision-guided model adaptation framework that autonomously mitigates domain shift for edge-deployed CSI sensing systems. The proposed system integrates a cross-modal teacher-student architecture, where a high-accuracy YOLO-based vision model serves as a dynamic supervisory signal, delivering real-time activity labels for the CSI data stream. These labels enable autonomous, online fine-tuning of a lightweight CSI-based HAR model, termed Sensing Technology for Activity Recognition (STAR), directly at the edge. This closed-loop retraining mechanism allows STAR to continuously adapt to environmental changes without manual intervention. Extensive experiments demonstrate the effectiveness of maxVSTAR. When deployed on uncalibrated hardware, the baseline STAR model’s recognition accuracy declined from 93.52% to 49.14%. Following a single vision-guided adaptation cycle, maxVSTAR restored the accuracy to 81.51%. These results confirm the system’s capacity for dynamic, self-supervised model adaptation in privacy-conscious IoT environments, establishing a scalable and practical paradigm for long-term autonomous HAR using CSI sensing at the network edge.

[351] STAR: A Privacy-Preserving, Energy-Efficient Edge AI Framework for Human Activity Recognition via Wi-Fi CSI in Mobile and Pervasive Computing Environments

Kexing Liu

Main category: cs.LG

TL;DR: STAR is an edge-AI-optimized framework for Wi-Fi CSI-based human activity recognition that achieves 93.52% accuracy with a lightweight 97.6k-parameter model, enabling real-time processing on low-power embedded devices.

DetailsMotivation: Existing Wi-Fi CSI-based HAR methods face computational inefficiency, high latency, and limited feasibility in resource-constrained mobile edge environments, requiring an optimized solution for practical deployment.

Method: Integrates lightweight GRU-based neural network (33% fewer parameters than LSTM), multi-stage signal processing (median filtering, Butterworth filtering, EMD), and hardware-aware co-optimization on Rockchip RV1126 with NPU and ESP32-S3 CSI module.

Result: Achieves 93.52% mean accuracy for 7 activity classes and 99.11% for presence detection. INT8 quantization enables 33 MHz processing speed with 8% CPU utilization, 6x faster than CPU execution, with sub-second latency and low power consumption.

Conclusion: STAR provides a practical, scalable solution for real-time, privacy-preserving HAR in mobile and pervasive computing environments through edge-AI optimization and efficient on-device deployment.

Abstract: Human Activity Recognition (HAR) via Wi-Fi Channel State Information (CSI) presents a privacy-preserving, contactless sensing approach suitable for smart homes, healthcare monitoring, and mobile IoT systems. However, existing methods often encounter computational inefficiency, high latency, and limited feasibility within resource-constrained, embedded mobile edge environments. This paper proposes STAR (Sensing Technology for Activity Recognition), an edge-AI-optimized framework that integrates a lightweight neural architecture, adaptive signal processing, and hardware-aware co-optimization to enable real-time, energy-efficient HAR on low-power embedded devices. STAR incorporates a streamlined Gated Recurrent Unit (GRU)-based recurrent neural network, reducing model parameters by 33% compared to conventional LSTM models while maintaining effective temporal modeling capability. A multi-stage pre-processing pipeline combining median filtering, 8th-order Butterworth low-pass filtering, and Empirical Mode Decomposition (EMD) is employed to denoise CSI amplitude data and extract spatial-temporal features. For on-device deployment, STAR is implemented on a Rockchip RV1126 processor equipped with an embedded Neural Processing Unit (NPU), interfaced with an ESP32-S3-based CSI acquisition module. Experimental results demonstrate a mean recognition accuracy of 93.52% across seven activity classes and 99.11% for human presence detection, utilizing a compact 97.6k-parameter model. INT8 quantized inference achieves a processing speed of 33 MHz with just 8% CPU utilization, delivering sixfold speed improvements over CPU-based execution. With sub-second response latency and low power consumption, the system ensures real-time, privacy-preserving HAR, offering a practical, scalable solution for mobile and pervasive computing environments.

[352] Bridging the Gap Between Molecule and Textual Descriptions via Substructure-aware Alignment

Hyuntae Park, Yeachan Kim, SangKeun Lee

Main category: cs.LG

TL;DR: MolBridge is a novel molecule-text learning framework that uses substructure-aware alignments to capture fine-grained correspondences between molecular substructures and chemical phrases, outperforming state-of-the-art baselines.

DetailsMotivation: Existing models struggle to capture subtle differences between molecules and their descriptions due to inability to learn fine-grained alignments between molecular substructures and chemical phrases.

Method: Augments molecule-description pairs with alignment signals from molecular substructures and chemical phrases, uses substructure-aware contrastive learning with self-refinement mechanism to filter noisy alignment signals.

Result: Outperforms state-of-the-art baselines on a wide range of molecular benchmarks, effectively captures fine-grained correspondences.

Conclusion: Substructure-aware alignment is significant for molecule-text learning, and MolBridge successfully addresses the limitation of existing models.

Abstract: Molecule and text representation learning has gained increasing interest due to its potential for enhancing the understanding of chemical information. However, existing models often struggle to capture subtle differences between molecules and their descriptions, as they lack the ability to learn fine-grained alignments between molecular substructures and chemical phrases. To address this limitation, we introduce MolBridge, a novel molecule-text learning framework based on substructure-aware alignments. Specifically, we augment the original molecule-description pairs with additional alignment signals derived from molecular substructures and chemical phrases. To effectively learn from these enriched alignments, MolBridge employs substructure-aware contrastive learning, coupled with a self-refinement mechanism that filters out noisy alignment signals. Experimental results show that MolBridge effectively captures fine-grained correspondences and outperforms state-of-the-art baselines on a wide range of molecular benchmarks, highlighting the significance of substructure-aware alignment in molecule-text learning.

[353] Segmentation over Complexity: Evaluating Ensemble and Hybrid Approaches for Anomaly Detection in Industrial Time Series

Emilio Mastriani, Alessandro Costa, Federico Incardona, Kevin Munari, Sebastiano Spinello

Main category: cs.LG

TL;DR: Complex feature engineering and hybrid models for anomaly detection in industrial time series underperformed compared to a simple Random Forest + XGBoost ensemble, which achieved superior results with 0.976 AUC-ROC and 100% early detection.

DetailsMotivation: To investigate whether advanced feature engineering and hybrid model architectures can improve anomaly detection performance in multivariate industrial time series data from steam turbine systems.

Method: Evaluated change point-derived statistical features, clustering-based substructure representations, and hybrid learning strategies, comparing them against a simple Random Forest + XGBoost ensemble trained on segmented data.

Result: The simple ensemble significantly outperformed complex approaches, achieving AUC-ROC of 0.976, F1-score of 0.41, and 100% early detection within the defined time window.

Conclusion: In scenarios with highly imbalanced and temporally uncertain data, model simplicity combined with optimized segmentation outperforms sophisticated architectures, offering greater robustness, interpretability, and operational utility.

Abstract: In this study, we investigate the effectiveness of advanced feature engineering and hybrid model architectures for anomaly detection in a multivariate industrial time series, focusing on a steam turbine system. We evaluate the impact of change point-derived statistical features, clustering-based substructure representations, and hybrid learning strategies on detection performance. Despite their theoretical appeal, these complex approaches consistently underperformed compared to a simple Random Forest + XGBoost ensemble trained on segmented data. The ensemble achieved an AUC-ROC of 0.976, F1-score of 0.41, and 100% early detection within the defined time window. Our findings highlight that, in scenarios with highly imbalanced and temporally uncertain data, model simplicity combined with optimized segmentation can outperform more sophisticated architectures, offering greater robustness, interpretability, and operational utility.

[354] A Game-Theoretic Spatio-Temporal Reinforcement Learning Framework for Collaborative Public Resource Allocation

Songxin Lei, Qiongyan Wang, Yanchen Zhu, Hanyu Yao, Sijie Ruan, Weilin Ruan, Yuyu Luo, Huaming Wu, Yuxuan Liang

Main category: cs.LG

TL;DR: Proposes Collaborative Public Resource Allocation (CPRA) with capacity constraints and spatio-temporal dynamics, solved using Game-Theoretic Spatio-Temporal Reinforcement Learning (GSTRL) framework.

DetailsMotivation: Existing public resource allocation methods optimize individual resources independently without considering capacity constraints, limiting practical applicability.

Method: Formulates CPRA as a potential game and develops GSTRL framework that captures spatio-temporal dynamics while providing theoretical foundation for Nash equilibrium approximation.

Result: GSTRL demonstrates superior performance on two real-world datasets compared to existing methods.

Conclusion: The proposed CPRA problem formulation and GSTRL framework effectively address capacity constraints and spatio-temporal dynamics in public resource allocation, with proven theoretical guarantees and practical performance.

Abstract: Public resource allocation involves the efficient distribution of resources, including urban infrastructure, energy, and transportation, to effectively meet societal demands. However, existing methods focus on optimizing the movement of individual resources independently, without considering their capacity constraints. To address this limitation, we propose a novel and more practical problem: Collaborative Public Resource Allocation (CPRA), which explicitly incorporates capacity constraints and spatio-temporal dynamics in real-world scenarios. We propose a new framework called Game-Theoretic Spatio-Temporal Reinforcement Learning (GSTRL) for solving CPRA. Our contributions are twofold:

  1. We formulate the CPRA problem as a potential game and demonstrate that there is no gap between the potential function and the optimal target, laying a solid theoretical foundation for approximating the Nash equilibrium of this NP-hard problem; and 2) Our designed GSTRL framework effectively captures the spatio-temporal dynamics of the overall system. We evaluate GSTRL on two real-world datasets, where experiments show its superior performance. Our source codes are available in the supplementary materials.

[355] Accumulative SGD Influence Estimation for Data Attribution

Yunxiao Shi, Shuo Yang, Yixin Su, Rui Zhang, Min Xu

Main category: cs.LG

TL;DR: ACC-SGD-IE is a trajectory-aware influence estimator that improves upon standard SGD-IE by tracking leave-one-out perturbations across training epochs, providing more accurate influence estimates especially for critical examples.

DetailsMotivation: Standard SGD-IE misranks critical examples by ignoring cross-epoch compounding effects and only summing per-epoch surrogates, leading to inaccurate per-sample influence estimates needed for modern data-centric AI.

Method: ACC-SGD-IE propagates leave-one-out perturbation across training and updates an accumulative influence state at each step, making it trajectory-aware. It works with both convex and non-convex training regimes.

Result: In smooth strongly convex settings, ACC-SGD-IE achieves geometric error contraction. In smooth non-convex regimes, it tightens error bounds. Larger mini-batches further reduce constants. Empirically, it yields more accurate influence estimates across Adult, 20 Newsgroups, and MNIST datasets under various conditions.

Conclusion: ACC-SGD-IE provides superior influence estimation for data cleansing, producing models that outperform those cleaned with standard SGD-IE, especially for identifying noisy samples over long training epochs.

Abstract: Modern data-centric AI needs precise per-sample influence. Standard SGD-IE approximates leave-one-out effects by summing per-epoch surrogates and ignores cross-epoch compounding, which misranks critical examples. We propose ACC-SGD-IE, a trajectory-aware estimator that propagates the leave-one-out perturbation across training and updates an accumulative influence state at each step. In smooth strongly convex settings it achieves geometric error contraction and, in smooth non-convex regimes, it tightens error bounds; larger mini-batches further reduce constants. Empirically, on Adult, 20 Newsgroups, and MNIST under clean and corrupted data and both convex and non-convex training, ACC-SGD-IE yields more accurate influence estimates, especially over long epochs. For downstream data cleansing it more reliably flags noisy samples, producing models trained on ACC-SGD-IE cleaned data that outperform those cleaned with SGD-IE.

[356] Predicting All-Cause Hospital Readmissions from Medical Claims Data of Hospitalised Patients

Avinash Kadimisetty, Arun Rajagopalan, Vijendra SK

Main category: cs.LG

TL;DR: Machine learning models (Random Forest, Logistic Regression, SVM) were used to predict hospital readmissions from health claims data, with PCA for dimensionality reduction. Random Forest performed best.

DetailsMotivation: Reducing preventable hospital readmissions is a national priority to improve healthcare quality and lower costs, with readmission rates used as quality benchmarks.

Method: Used Logistic Regression, Random Forest, and SVM on health claims data with PCA for dimension reduction. Compared models using AUC metric.

Result: Random Forest achieved the highest performance, followed by Logistic Regression and SVM. Models identified crucial demographic and medical factors for readmission prediction.

Conclusion: These ML models can identify key factors causing readmissions and help target high-risk patients to reduce readmission rates, lowering costs and improving healthcare quality.

Abstract: Reducing preventable hospital readmissions is a national priority for payers, providers, and policymakers seeking to improve health care and lower costs. The rate of readmission is being used as a benchmark to determine the quality of healthcare provided by the hospitals. In thisproject, we have used machine learning techniques like Logistic Regression, Random Forest and Support Vector Machines to analyze the health claims data and identify demographic and medical factors that play a crucial role in predicting all-cause readmissions. As the health claims data is high dimensional, we have used Principal Component Analysis as a dimension reduction technique and used the results for building regression models. We compared and evaluated these models based on the Area Under Curve (AUC) metric. Random Forest model gave the highest performance followed by Logistic Regression and Support Vector Machine models. These models can be used to identify the crucial factors causing readmissions and help identify patients to focus on to reduce the chances of readmission, ultimately bringing down the cost and increasing the quality of healthcare provided to the patients.

[357] Test-Time Alignment of LLMs via Sampling-Based Optimal Control in pre-logit space

Sekitoshi Kanai, Tsukasa Yoshida, Hiroshi Takahashi, Haru Kuroki, Kazumune Hashimoto

Main category: cs.LG

TL;DR: AISP is a test-time alignment method that applies Gaussian perturbation to pre-logits to maximize rewards using importance sampling, outperforming existing methods.

DetailsMotivation: Fine-tuning LLMs requires high computational costs, making test-time alignment methods more attractive for efficient model adaptation.

Method: AISP applies Gaussian perturbation to pre-logits (outputs of penultimate layer) and uses importance sampling with sampled rewards to find the optimal mean for maximizing expected rewards.

Result: AISP outperforms best-of-n sampling in reward efficiency and achieves higher rewards than other reward-based test-time alignment methods.

Conclusion: AISP provides an effective test-time alignment approach that doesn’t require expensive fine-tuning while achieving superior reward performance.

Abstract: Test-time alignment of large language models (LLMs) attracts attention because fine-tuning LLMs requires high computational costs. In this paper, we propose a new test-time alignment method called adaptive importance sampling on pre-logits (AISP) on the basis of the sampling-based model predictive control with the stochastic control input. AISP applies the Gaussian perturbation into pre-logits, which are outputs of the penultimate layer, so as to maximize expected rewards with respect to the mean of the perturbation. We demonstrate that the optimal mean is obtained by importance sampling with sampled rewards. AISP outperforms best-of-n sampling in terms of rewards over the number of used samples and achieves higher rewards than other reward-based test-time alignment methods.

[358] MPRU: Modular Projection-Redistribution Unlearning as Output Filter for Classification Pipelines

Minyi Peng, Darian Gunamardi, Ivan Tjuawinata, Kwok-Yan Lam

Main category: cs.LG

TL;DR: A novel machine unlearning approach that treats classification training as a sequential process, enabling efficient knowledge removal by reversing the last training sequence using a projection-redistribution layer.

DetailsMotivation: Existing machine unlearning methods face scalability issues and require full access to original datasets and models, making them impractical for real-world deployment.

Method: Inductive approach where classes are learned sequentially, with unlearning implemented by reversing the last training sequence using a projection-redistribution layer appended to the model.

Result: Consistently similar output to fully retrained models with high computational cost reduction across multiple datasets (CIFAR-10/100 with CNN, Covertype with tree-based models).

Conclusion: The proposed solution demonstrates applicability, scalability, and system compatibility while maintaining performance in practical settings, enabling modular and model-agnostic deployment.

Abstract: As a new and promising approach, existing machine unlearning (MU) works typically emphasize theoretical formulations or optimization objectives to achieve knowledge removal. However, when deployed in real-world scenarios, such solutions typically face scalability issues and have to address practical requirements such as full access to original datasets and model. In contrast to the existing approaches, we regard classification training as a sequential process where classes are learned sequentially, which we call \emph{inductive approach}. Unlearning can then be done by reversing the last training sequence. This is implemented by appending a projection-redistribution layer in the end of the model. Such an approach does not require full access to the original dataset or the model, addressing the challenges of existing methods. This enables modular and model-agnostic deployment as an output filter into existing classification pipelines with minimal alterations. We conducted multiple experiments across multiple datasets including image (CIFAR-10/100 using CNN-based model) and tabular datasets (Covertype using tree-based model). Experiment results show consistently similar output to a fully retrained model with a high computational cost reduction. This demonstrates the applicability, scalability, and system compatibility of our solution while maintaining the performance of the output in a more practical setting.

[359] Angular Steering: Behavior Control via Rotation in Activation Space

Hieu M. Vu, Tan M. Nguyen

Main category: cs.LG

TL;DR: Angular Steering is a novel method for controlling specific behaviors in LLMs by rotating activations within a 2D subspace, providing fine-grained control while maintaining general capabilities.

DetailsMotivation: Current steering methods are constrained to 2D subspaces and are sensitive to parameter choices, potentially affecting unrelated features due to unintended activation space interactions.

Method: Formulates steering as geometric rotation toward/away from target behavior direction within fixed 2D subspace. Also proposes Adaptive Angular Steering that rotates only target-aligned activations.

Result: Achieves robust behavioral control while maintaining language modeling performance across multiple model families and sizes. Provides continuous, fine-grained control over behaviors like refusal and compliance.

Conclusion: Angular Steering generalizes existing techniques under unified geometric framework, simplifies parameter selection, and maintains model stability across broader adjustment ranges, offering flexibility and robustness.

Abstract: Controlling specific behaviors in large language models while preserving their general capabilities is a central challenge for safe and reliable artificial intelligence deployment. Current steering methods, such as vector addition and directional ablation, are constrained within a two-dimensional subspace defined by the activation and feature direction, making them sensitive to chosen parameters and potentially affecting unrelated features due to unintended interactions in activation space. We introduce Angular Steering, a novel and flexible method for behavior modulation that operates by rotating activations within a fixed two-dimensional subspace. By formulating steering as a geometric rotation toward or away from a target behavior direction, Angular Steering provides continuous, fine-grained control over behaviors such as refusal and compliance. We demonstrate this method using refusal steering emotion steering as use cases. Additionally, we propose Adaptive Angular Steering, a selective variant that rotates only activations aligned with the target feature, further enhancing stability and coherence. Angular Steering generalizes existing addition and orthogonalization techniques under a unified geometric rotation framework, simplifying parameter selection and maintaining model stability across a broader range of adjustments. Experiments across multiple model families and sizes show that Angular Steering achieves robust behavioral control while maintaining general language modeling performance, underscoring its flexibility, generalization, and robustness compared to prior approaches. Code and artifacts are available at https://github.com/lone17/angular-steering/.

[360] Likely Interpolants of Generative Models

Frederik Möbius Rygaard, Shen Zhu, Yinzhu Jin, Søren Hauberg, Tom Fletcher

Main category: cs.LG

TL;DR: A general interpolation method for generative models that finds likely transition paths through data distributions without requiring additional training.

DetailsMotivation: Most generative models lack principled interpolation methods without restrictive assumptions, limiting controlled generation and model inspection capabilities.

Method: Developed a novel algorithm that computes interpolants analogous to geodesics constrained to data distributions, requiring no additional training and working with different metrics and probability distributions.

Result: The method locally behaves as geodesics under Riemannian metrics and quantitatively traverses higher density regions than baseline methods across various models and datasets.

Conclusion: The proposed interpolation scheme provides a principled approach for generative model interpolation that better follows data distributions without restrictive assumptions.

Abstract: Interpolation in generative models allows for controlled generation, model inspection, and more. Unfortunately, most generative models lack a principal notion of interpolants without restrictive assumptions on either the model or data dimension. In this paper, we develop a general interpolation scheme that targets likely transition paths compatible with different metrics and probability distributions. We consider interpolants analogous to a geodesic constrained to a suitable data distribution and derive a novel algorithm for computing these curves, which requires no additional training. Theoretically, we show that our method locally can be considered as a geodesic under a suitable Riemannian metric. We quantitatively show that our interpolation scheme traverses higher density regions than baselines across a range of models and datasets.

[361] Distributional Multi-objective Black-box Optimization for Diffusion-model Inference-time Multi-Target Generation

Kim Yong Tan, Yueming Lyu, Ivor Tsang, Yew-Soon Ong

Main category: cs.LG

TL;DR: IMG algorithm optimizes diffusion process at inference-time for multi-objective optimization using weighted resampling to generate samples satisfying multiple objectives simultaneously.

DetailsMotivation: Existing approaches treat diffusion models as black-box refiners and overlook internal distribution transitions, limiting efficiency in multi-objective optimization.

Method: Performs weighted resampling during diffusion generation according to expected aggregated multi-objective values to ensure samples follow desired multi-target Boltzmann distribution.

Result: IMG achieves significantly higher hypervolume than baseline algorithms with only single generation pass, outperforming methods requiring hundreds of diffusion generations.

Conclusion: IMG provides efficient inference-time optimization for diffusion models in multi-objective tasks and can be integrated into existing methods to improve performance.

Abstract: Diffusion models have been successful in learning complex data distributions. This capability has driven their application to high-dimensional multi-objective black-box optimization problem. Existing approaches often employ an external optimization loop, such as an evolutionary algorithm, to the diffusion model. However, these approaches treat the diffusion model as a black-box refiner, which overlooks the internal distribution transition of the diffusion generation process, limiting their efficiency. To address these challenges, we propose the Inference-time Multi-target Generation (IMG) algorithm, which optimizes the diffusion process at inference-time to generate samples that simultaneously satisfy multiple objectives. Specifically, our IMG performs weighted resampling during the diffusion generation process according to the expected aggregated multi-objective values. This weighted resampling strategy ensures the diffusion-generated samples are distributed according to our desired multi-target Boltzmann distribution. We further derive that the multi-target Boltzmann distribution has an interesting log-likelihood interpretation, where it is the optimal solution to the distributional multi-objective optimization problem. We implemented IMG for a multi-objective molecule generation task. Experiments show that IMG, requiring only a single generation pass, achieves a significantly higher hypervolume than baseline optimization algorithms that often require hundreds of diffusion generations. Notably, our algorithm can be viewed as an optimized diffusion process and can be integrated into existing methods to further improve their performance.

[362] Empirical Bayesian Multi-Bandit Learning

Xia Jiang, Rong J. B. Zhu

Main category: cs.LG

TL;DR: A hierarchical Bayesian framework for multi-task contextual bandits that captures task heterogeneity and correlations through empirical Bayesian estimation of covariance structure, with two efficient algorithms (ebmTS and ebmUCB) showing superior performance.

DetailsMotivation: To enhance decision-making across multiple related bandit tasks by leveraging shared structures while accommodating task-specific heterogeneity, addressing limitations of previous methods that overlook covariance structure learning.

Method: Proposed hierarchical Bayesian model with empirical Bayesian approach to estimate prior covariance matrix, developed ebmTS (Thompson Sampling) and ebmUCB (Upper Confidence Bound) algorithms that incorporate the estimated prior.

Result: Algorithms achieve lower cumulative regret than existing methods on synthetic and real-world datasets, particularly in complex environments, with theoretical frequentist regret upper bounds provided.

Conclusion: The hierarchical Bayesian framework effectively balances exploration and exploitation across multi-bandits, demonstrating superior performance through both theoretical guarantees and empirical validation.

Abstract: Multi-task learning in contextual bandits has attracted significant research interest due to its potential to enhance decision-making across multiple related tasks by leveraging shared structures and task-specific heterogeneity. In this article, we propose a novel hierarchical Bayesian framework for learning in various bandit instances. This framework captures both the heterogeneity and the correlations among different bandit instances through a hierarchical Bayesian model, enabling effective information sharing while accommodating instance-specific variations. Unlike previous methods that overlook the learning of the covariance structure across bandits, we introduce an empirical Bayesian approach to estimate the covariance matrix of the prior distribution.This enhances both the practicality and flexibility of learning across multi-bandits. Building on this approach, we develop two efficient algorithms: ebmTS (Empirical Bayesian Multi-Bandit Thompson Sampling) and ebmUCB (Empirical Bayesian Multi-Bandit Upper Confidence Bound), both of which incorporate the estimated prior into the decision-making process. We provide the frequentist regret upper bounds for the proposed algorithms, thereby filling a research gap in the field of multi-bandit problems. Extensive experiments on both synthetic and real-world datasets demonstrate the superior performance of our algorithms, particularly in complex environments. Our methods achieve lower cumulative regret compared to existing techniques, highlighting their effectiveness in balancing exploration and exploitation across multi-bandits.

[363] Offline Clustering of Preference Learning with Active-data Augmentation

Jingyuan Liu, Fatemeh Ghaffari, Xuchuang Wang, Mohammad Hajiesmaili, Carlee Joe-Wong

Main category: cs.LG

TL;DR: The paper proposes offline clustering methods for preference learning from pairwise feedback, addressing challenges of user heterogeneity and imbalanced data through two algorithms: Off-C²PL for pure offline settings and A²-Off-C²PL for active data augmentation.

DetailsMotivation: Real-world preference learning involves users with different preferences, creating challenges in aggregating imbalanced offline data across different preference dimensions and effectively utilizing limited user interactions.

Method: Two algorithms: Off-C²PL for pure offline learning with theoretical suboptimality bounds, and A²-Off-C²PL that extends to active data augmentation by selecting samples targeting the least-informative dimensions of test user preferences.

Result: Theoretical analysis shows tradeoff between sample noise and bias, and proves that actively collected samples are more effective than offline ones. Simulations on synthetic and real-world datasets validate the theoretical results.

Conclusion: The proposed offline clustering framework effectively handles user heterogeneity and data imbalance in preference learning, with active data augmentation providing significant benefits for underrepresented preference dimensions.

Abstract: Preference learning from pairwise feedback is a widely adopted framework in applications such as reinforcement learning with human feedback and recommendations. In many practical settings, however, user interactions are limited or costly, making offline preference learning necessary. Moreover, real-world preference learning often involves users with different preferences. For example, annotators from different backgrounds may rank the same responses differently. This setting presents two central challenges: (1) identifying similarity across users to effectively aggregate data, especially under scenarios where offline data is imbalanced across dimensions, and (2) handling the imbalanced offline data where some preference dimensions are underrepresented. To address these challenges, we study the Offline Clustering of Preference Learning problem, where the learner has access to fixed datasets from multiple users with potentially different preferences and aims to maximize utility for a test user. To tackle the first challenge, we first propose Off-C$^2$PL for the pure offline setting, where the learner relies solely on offline data. Our theoretical analysis provides a suboptimality bound that explicitly captures the tradeoff between sample noise and bias. To address the second challenge of inbalanced data, we extend our framework to the setting with active-data augmentation where the learner is allowed to select a limited number of additional active-data for the test user based on the cluster structure learned by Off-C$^2$PL. In this setting, our second algorithm, A$^2$-Off-C$^2$PL, actively selects samples that target the least-informative dimensions of the test user’s preference. We prove that these actively collected samples contribute more effectively than offline ones. Finally, we validate our theoretical results through simulations on synthetic and real-world datasets.

[364] Understanding Hardness of Vision-Language Compositionality from A Token-level Causal Lens

Ziliang Chen, Tianang Xiao, Jusheng Zhang, Yongsen Zheng, Xipeng Chen

Main category: cs.LG

TL;DR: CLIP fails at compositional reasoning despite strong cross-modal alignment. A token-aware causal framework reveals composition nonidentifiability as the root cause, explaining why CLIP behaves like a bag-of-words matcher and struggles with hard negatives.

DetailsMotivation: Previous causal accounts of CLIP's compositional failures modeled text as single vectors, obscuring token-level structure and leaving phenomena like prompt sensitivity and hard negative failures unexplained.

Method: Proposed a token-aware causal representation learning framework using sequential, language-token structural causal models. Extended block identifiability to tokenized text and analyzed composition nonidentifiability.

Result: Proved that pseudo-optimal text encoders can achieve perfect modal-invariant alignment yet remain insensitive to compositional operations (SWAP, REPLACE, ADD), explaining CLIP’s brittleness. Linked language-side nonidentifiability to visual failures via modality gap.

Conclusion: Composition nonidentifiability is the principled explanation for CLIP’s compositional brittleness. The analysis motivates improved negative mining strategies to address iterated composition hardness.

Abstract: Contrastive Language-Image Pre-training (CLIP) delivers strong cross modal generalization by aligning images and texts in a shared embedding space, yet it persistently fails at compositional reasoning over objects, attributes, and relations often behaving like a bag-of-words matcher. Prior causal accounts typically model text as a single vector, obscuring token-level structure and leaving core phenomena-such as prompt sensitivity and failures on hard negatives unexplained. We address this gap with a token-aware causal representation learning (CRL) framework grounded in a sequential, language-token SCM. Our theory extends block identifiability to tokenized text, proving that CLIP’s contrastive objective can recover the modal-invariant latent variable under both sentence-level and token-level SCMs. Crucially, token granularity yields the first principled explanation of CLIP’s compositional brittleness: composition nonidentifiability. We show the existence of pseudo-optimal text encoders that achieve perfect modal-invariant alignment yet are provably insensitive to SWAP, REPLACE, and ADD operations over atomic concepts, thereby failing to distinguish correct captions from hard negatives despite optimizing the same training objective as true-optimal encoders. The analysis further links language-side nonidentifiability to visual-side failures via the modality gap and shows how iterated composition operators compound hardness, motivating improved negative mining strategies.

[365] Implicit Bias of Per-sample Adam on Separable Data: Departure from the Full-batch Regime

Beomhan Baek, Minhak Song, Chulhee Yun

Main category: cs.LG

TL;DR: Incremental Adam’s implicit bias differs from full-batch Adam - it can converge to ℓ₂-max-margin classifier instead of ℓ∞-max-margin, depending on dataset structure and batching scheme.

DetailsMotivation: Adam is widely used but its theoretical understanding is limited, especially regarding how batch size affects its implicit bias in optimization.

Method: Analyzed incremental Adam (one sample per step) for logistic regression on separable data, developed proxy algorithm for β₂→1 limit, and compared with Signum optimizer.

Result: Incremental Adam converges to ℓ₂-max-margin classifier on structured datasets, while full-batch Adam favors ℓ∞-max-margin. Signum consistently converges to ℓ∞-max-margin regardless of batch size.

Conclusion: Adam’s implicit bias depends on both batching scheme and dataset structure, while Signum remains invariant to batch size, highlighting important practical implications for optimizer selection.

Abstract: Adam [Kingma and Ba, 2015] is the de facto optimizer in deep learning, yet its theoretical understanding remains limited. Prior analyses show that Adam favors solutions aligned with $\ell_\infty$-geometry, but these results are restricted to the full-batch regime. In this work, we study the implicit bias of incremental Adam (using one sample per step) for logistic regression on linearly separable data, and we show that its bias can deviate from the full-batch behavior. To illustrate this, we construct a class of structured datasets where incremental Adam provably converges to the $\ell_2$-max-margin classifier, in contrast to the $\ell_\infty$-max-margin bias of full-batch Adam. For general datasets, we develop a proxy algorithm that captures the limiting behavior of incremental Adam as $\beta_2 \to 1$ and we characterize its convergence direction via a data-dependent dual fixed-point formulation. Finally, we prove that, unlike Adam, Signum [Bernstein et al., 2018] converges to the $\ell_\infty$-max-margin classifier for any batch size by taking $\beta$ close enough to 1. Overall, our results highlight that the implicit bias of Adam crucially depends on both the batching scheme and the dataset, while Signum remains invariant.

[366] Model Inversion with Layer-Specific Modeling and Alignment for Data-Free Continual Learning

Ruilin Tong, Haodong Lu, Yuhang Liu, Dong Gong

Main category: cs.LG

TL;DR: The paper proposes Per-layer Model Inversion (PMI) and feature modeling for data-free continual learning, enabling efficient generation of synthetic data from pre-trained models without storing real samples.

DetailsMotivation: Data-free continual learning is needed when storing and replaying real data is infeasible due to privacy/security constraints or impractical for arbitrary pre-trained models. Existing model inversion methods face challenges with feature drift and computational expense.

Method: Proposes PMI for efficient initialization of full-model inversion, reducing iterations. Uses Gaussian distributions and contrastive modeling to align synthetic and real features. Generates pseudo-images from semantic-aware projected features for continual learning of new classes.

Result: The approach achieves strong effectiveness and compatibility across multiple continual learning settings, substantially reducing computational iterations while maintaining feature alignment.

Conclusion: Combining PMI with feature modeling enables efficient and effective data-free continual learning, addressing key challenges of feature drift and computational expense in model inversion for large pre-trained models like CLIP.

Abstract: Continual learning (CL) aims to incrementally train a model on a sequence of tasks while retaining performance on prior ones. However, storing and replaying data is often infeasible due to privacy or security constraints and impractical for arbitrary pre-trained models. Data-free CL seeks to update models without access to previous data. Beyond regularization, we employ model inversion to synthesize data from the trained model, enabling replay without storing samples. Yet, model inversion in predictive models faces two challenges: (1) generating inputs solely from compressed output labels causes drift between synthetic and real data, and replaying such data can erode prior knowledge; (2) inversion is computationally expensive since each step backpropagates through the full model. These issues are amplified in large pre-trained models such as CLIP. To improve efficiency, we propose Per-layer Model Inversion (PMI), inspired by faster convergence in single-layer optimization. PMI provides strong initialization for full-model inversion, substantially reducing iterations. To mitigate feature shift, we model class-wise features via Gaussian distributions and contrastive model, ensuring alignment between synthetic and real features. Combining PMI and feature modeling, our approach enables continual learning of new classes by generating pseudo-images from semantic-aware projected features, achieving strong effectiveness and compatibility across multiple CL settings.

[367] On the Impact of Weight Discretization in QUBO-Based SVM Training

Sascha Mücke

Main category: cs.LG

TL;DR: Quantum annealing for SVM training via QUBO formulation shows competitive performance even with low-precision encoding (1 bit per parameter), suggesting support vector selection matters more than precise weighting.

DetailsMotivation: To study how qubit count (discretization level of dual weights) affects SVM predictive performance and compare QUBO-based SVM training with classical LIBSVM solver.

Method: Formulate SVM training as QUBO problem and use quantum annealing for optimization, varying the number of qubits (bit-depth) for parameter discretization.

Result: Low-precision QUBO encodings (e.g., 1 bit per parameter) yield competitive and sometimes superior accuracy compared to LIBSVM. Increased bit-depth enables larger regularization but doesn’t always improve classification.

Conclusion: Support vector selection may be more important than precise weighting. Quantum annealing shows potential for efficient SVM training as quantum devices scale, despite current hardware limitations on QUBO size.

Abstract: Training Support Vector Machines (SVMs) can be formulated as a QUBO problem, enabling the use of quantum annealing for model optimization. In this work, we study how the number of qubits - linked to the discretization level of dual weights - affects predictive performance across datasets. We compare QUBO-based SVM training to the classical LIBSVM solver and find that even low-precision QUBO encodings (e.g., 1 bit per parameter) yield competitive, and sometimes superior, accuracy. While increased bit-depth enables larger regularization parameters, it does not always improve classification. Our findings suggest that selecting the right support vectors may matter more than their precise weighting. Although current hardware limits the size of solvable QUBOs, our results highlight the potential of quantum annealing for efficient SVM training as quantum devices scale.

[368] Posterior Sampling by Combining Diffusion Models with Annealed Langevin Dynamics

Zhiyang Xun, Shivam Gupta, Eric Price

Main category: cs.LG

TL;DR: This paper presents a method for conditional sampling from posterior distributions in noisy linear inverse problems by combining diffusion models with annealed Langevin dynamics, achieving polynomial-time sampling with only L^4 bounds on score estimation error.

DetailsMotivation: Posterior sampling is crucial for tasks like inpainting, deblurring, and MRI reconstruction, but approximate posterior sampling is computationally intractable in general. Existing methods like Langevin dynamics are brittle to score estimation errors.

Method: The authors combine diffusion models with an annealed variant of Langevin dynamics to achieve conditional sampling from log-concave distributions in noisy linear measurement settings.

Result: The proposed method achieves polynomial-time conditional sampling using only an L^4 bound on the score estimation error, which is less restrictive than the MGF/sub-exponential bounds required by standard Langevin dynamics.

Conclusion: By merging diffusion models with annealed Langevin dynamics, the paper provides an efficient and robust framework for posterior sampling in inverse problems, overcoming the brittleness of previous methods to score estimation errors.

Abstract: Given a noisy linear measurement $y = Ax + \xi$ of a distribution $p(x)$, and a good approximation to the prior $p(x)$, when can we sample from the posterior $p(x \mid y)$? Posterior sampling provides an accurate and fair framework for tasks such as inpainting, deblurring, and MRI reconstruction, and several heuristics attempt to approximate it. Unfortunately, approximate posterior sampling is computationally intractable in general. To sidestep this hardness, we focus on (local or global) log-concave distributions $p(x)$. In this regime, Langevin dynamics yields posterior samples when the exact scores of $p(x)$ are available, but it is brittle to score–estimation error, requiring an MGF bound (sub-exponential error). By contrast, in the unconditional setting, diffusion models succeed with only an $L^2$ bound on the score error. We prove that combining diffusion models with an annealed variant of Langevin dynamics achieves conditional sampling in polynomial time using merely an $L^4$ bound on the score error.

[369] Agent Skills Enable a New Class of Realistic and Trivially Simple Prompt Injections

David Schmotz, Sahar Abdelnabi, Maksym Andriushchenko

Main category: cs.LG

TL;DR: Agent Skills framework in LLMs enables continual learning but is fundamentally insecure due to trivial prompt injections that can exfiltrate sensitive data and bypass system guardrails.

DetailsMotivation: To analyze the security vulnerabilities in Agent Skills framework for LLMs, which was introduced as a solution for continual learning but creates serious security risks.

Method: Demonstrated how to hide malicious instructions in long Agent Skill files and referenced scripts, and showed how to bypass system-level guardrails using benign task approvals with the “Don’t ask again” option.

Result: Successfully exfiltrated sensitive data like internal files and passwords, and bypassed security guardrails of a popular coding agent, showing that closely related harmful actions can inherit approval from benign tasks.

Conclusion: Despite ongoing research and scaling model capabilities, frontier LLMs remain vulnerable to simple prompt injections in realistic scenarios, highlighting fundamental security flaws in the Agent Skills framework.

Abstract: Enabling continual learning in LLMs remains a key unresolved research challenge. In a recent announcement, a frontier LLM company made a step towards this by introducing Agent Skills, a framework that equips agents with new knowledge based on instructions stored in simple markdown files. Although Agent Skills can be a very useful tool, we show that they are fundamentally insecure, since they enable trivially simple prompt injections. We demonstrate how to hide malicious instructions in long Agent Skill files and referenced scripts to exfiltrate sensitive data, such as internal files or passwords. Importantly, we show how to bypass system-level guardrails of a popular coding agent: a benign, task-specific approval with the “Don’t ask again” option can carry over to closely related but harmful actions. Overall, we conclude that despite ongoing research efforts and scaling model capabilities, frontier LLMs remain vulnerable to very simple prompt injections in realistic scenarios. Our code is available at https://github.com/aisa-group/promptinject-agent-skills.

[370] Linear Causal Discovery with Interventional Constraints

Zhigao Guo, Feng Dong

Main category: cs.LG

TL;DR: Introduces interventional constraints as a novel concept in causal discovery that encodes high-level causal knowledge through inequality constraints on causal effects, bridging the gap between structural constraints and actual causal influences.

DetailsMotivation: Existing causal discovery methods can enforce structural constraints but may still produce incorrect causal conclusions (e.g., learning opposite causal effects). Incorporating causal knowledge and mechanisms is essential for refining causal models and improving downstream tasks like treatment design.

Method: Proposes a metric to quantify total causal effects for linear causal models and formulates the problem as a constrained optimization task, solved using a two-stage constrained optimization method.

Result: Evaluation on real-world datasets shows that integrating interventional constraints improves model accuracy, ensures consistency with established findings, makes models more explainable, and facilitates discovery of new causal relationships that would be costly to identify otherwise.

Conclusion: Interventional constraints provide an effective framework for incorporating causal knowledge into causal discovery, bridging the gap between structural constraints and actual causal effects, leading to more accurate and explainable causal models.

Abstract: Incorporating causal knowledge and mechanisms is essential for refining causal models and improving downstream tasks such as designing new treatments. In this paper, we introduce a novel concept in causal discovery, termed interventional constraints, which differs fundamentally from interventional data. While interventional data require direct perturbations of variables, interventional constraints encode high-level causal knowledge in the form of inequality constraints on causal effects. For instance, in the Sachs dataset (Sachs et al.\ 2005), Akt has been shown to be activated by PIP3, meaning PIP3 exerts a positive causal effect on Akt. Existing causal discovery methods allow enforcing structural constraints (for example, requiring a causal path from PIP3 to Akt), but they may still produce incorrect causal conclusions such as learning that “PIP3 inhibits Akt”. Interventional constraints bridge this gap by explicitly constraining the total causal effect between variable pairs, ensuring learned models respect known causal influences. To formalize interventional constraints, we propose a metric to quantify total causal effects for linear causal models and formulate the problem as a constrained optimization task, solved using a two-stage constrained optimization method. We evaluate our approach on real-world datasets and demonstrate that integrating interventional constraints not only improves model accuracy and ensures consistency with established findings, making models more explainable, but also facilitates the discovery of new causal relationships that would otherwise be costly to identify.

[371] Reinforcement Learning for Pollution Detection in a Randomized, Sparse and Nonstationary Environment with an Autonomous Underwater Vehicle

Sebastian Zieglmeier, Niklas Erdmann, Narada D. Warakagoda

Main category: cs.LG

TL;DR: Modified Monte Carlo RL approach outperforms traditional Q-learning and exhaustive search in sparse, random, nonstationary environments like underwater pollution detection with AUVs.

DetailsMotivation: RL algorithms struggle in random, nonstationary environments with sparse rewards, particularly in applications like underwater pollution cloud detection where actions often yield zero rewards.

Method: Systematic modifications to classical RL including hierarchical algorithm changes, multigoal learning, and integration of location memory as external filter to prevent state revisits.

Result: Modified Monte Carlo-based approach significantly outperforms traditional Q-learning and two exhaustive search patterns in sparse, randomized environments.

Conclusion: RL approaches can be effectively adapted for use in random, nonstationary, and reward-sparse environments through systematic modifications to classical methods.

Abstract: Reinforcement learning (RL) algorithms are designed to optimize problem-solving by learning actions that maximize rewards, a task that becomes particularly challenging in random and nonstationary environments. Even advanced RL algorithms are often limited in their ability to solve problems in these conditions. In applications such as searching for underwater pollution clouds with autonomous underwater vehicles (AUVs), RL algorithms must navigate reward-sparse environments, where actions frequently result in a zero reward. This paper aims to address these challenges by revisiting and modifying classical RL approaches to efficiently operate in sparse, randomized, and nonstationary environments. We systematically study a large number of modifications, including hierarchical algorithm changes, multigoal learning, and the integration of a location memory as an external output filter to prevent state revisits. Our results demonstrate that a modified Monte Carlo-based approach significantly outperforms traditional Q-learning and two exhaustive search patterns, illustrating its potential in adapting RL to complex environments. These findings suggest that reinforcement learning approaches can be effectively adapted for use in random, nonstationary, and reward-sparse environments.

[372] UnifiedFL: A Dynamic Unified Learning Framework for Equitable Federation

Furkan Pala, Islem Rekik

Main category: cs.LG

TL;DR: UnifiedFL is a dynamic federated learning framework that addresses architectural and statistical heterogeneity across clients by representing different neural networks as nodes in a graph optimized by a shared GNN, with clustering and two-tier aggregation for improved performance.

DetailsMotivation: Current FL methods fail to support fundamentally different neural architectures across clients (e.g., CNNs, GNNs, MLPs) and overlook domain-fracture problems where client data distributions differ from test domains, limiting model generalizability.

Method: Represents heterogeneous local networks as nodes/edges in a directed model graph optimized by a shared GNN, uses Euclidean distance-based clustering for clients, and implements a two-tier aggregation policy balancing convergence and diversity.

Result: Experiments on MedMNIST classification and hippocampus segmentation benchmarks demonstrate UnifiedFL’s superior performance compared to existing methods.

Conclusion: UnifiedFL successfully addresses architectural heterogeneity and domain-fracture problems in federated learning, enabling effective collaboration across clients with fundamentally different neural architectures and non-identically distributed datasets.

Abstract: Federated learning (FL) has emerged as a key paradigm for collaborative model training across multiple clients without sharing raw data, enabling privacy-preserving applications in areas such as radiology and pathology. However, works on collaborative training across clients with fundamentally different neural architectures and non-identically distributed datasets remain scarce. Existing FL frameworks face several limitations. Despite claiming to support architectural heterogeneity, most recent FL methods only tolerate variants within a single model family (e.g., shallower, deeper, or wider CNNs), still presuming a shared global architecture and failing to accommodate federations where clients deploy fundamentally different network types (e.g., CNNs, GNNs, MLPs). Moreover, existing approaches often address only statistical heterogeneity while overlooking the domain-fracture problem, where each client’s data distribution differs markedly from that faced at testing time, undermining model generalizability. When clients use different architectures, have non-identically distributed data, and encounter distinct test domains, current methods perform poorly. To address these challenges, we propose UnifiedFL, a dynamic federated learning framework that represents heterogeneous local networks as nodes and edges in a directed model graph optimized by a shared graph neural network (GNN). UnifiedFL introduces (i) a common GNN to parameterize all architectures, (ii) distance-driven clustering via Euclidean distances between clients’ parameters, and (iii) a two-tier aggregation policy balancing convergence and diversity. Experiments on MedMNIST classification and hippocampus segmentation benchmarks demonstrate UnifiedFL’s superior performance. Code and data: https://github.com/basiralab/UnifiedFL

[373] Towards Explainable and Reliable AI in Finance

Albi Isufaj, Pablo Mollá, Helmut Prendinger

Main category: cs.LG

TL;DR: The paper presents an explainable AI framework for financial forecasting that combines time series foundation models with reliability estimation and symbolic reasoning to ensure only reliable and explainable forecasts are executed.

DetailsMotivation: Address the opacity challenges of large neural network models in financial forecasting, which raises issues for trust and regulatory compliance.

Method: Three approaches: 1) Time-LLM foundation model with prompts to avoid wrong directional forecasts, 2) Combining foundation models with reliability estimators to filter unreliable predictions, 3) Symbolic reasoning encoding domain rules for transparent justification.

Result: Experiments on equity and cryptocurrency data show reduced false positives and support for selective execution of forecasts.

Conclusion: The framework advances transparent and auditable financial AI systems by integrating predictive performance with reliability estimation and rule-based reasoning.

Abstract: Financial forecasting increasingly uses large neural network models, but their opacity raises challenges for trust and regulatory compliance. We present several approaches to explainable and reliable AI in finance. \emph{First}, we describe how Time-LLM, a time series foundation model, uses a prompt to avoid a wrong directional forecast. \emph{Second}, we show that combining foundation models for time series forecasting with a reliability estimator can filter our unreliable predictions. \emph{Third}, we argue for symbolic reasoning encoding domain rules for transparent justification. These approaches shift emphasize executing only forecasts that are both reliable and explainable. Experiments on equity and cryptocurrency data show that the architecture reduces false positives and supports selective execution. By integrating predictive performance with reliability estimation and rule-based reasoning, our framework advances transparent and auditable financial AI systems.

[374] CorVS: Person Identification via Video Trajectory-Sensor Correspondence in a Real-World Warehouse

Kazuma Kano, Yuki Mori, Shin Katayama, Kenta Urano, Takuro Yonezawa, Nobuo Kawaguchi

Main category: cs.LG

TL;DR: CorVS is a novel person identification method that matches visual tracking trajectories with wearable sensor measurements using deep learning to predict correspondence probabilities and reliability, enabling robust worker identification in industrial settings.

DetailsMotivation: Worker location data is crucial for productivity in industrial sites. While cameras offer valuable environmental context, visual-only identification is impractical. Existing methods combining trajectories and sensors can fail under real-world conditions.

Method: 1) Deep learning model predicts correspondence probabilities and reliabilities for each trajectory-sensor pair 2) Algorithm matches trajectories and sensor measurements over time using predicted probabilities and reliabilities

Result: The method was demonstrated to be effective for real-world applications using a dataset collected from actual warehouse operations.

Conclusion: CorVS provides a robust data-driven solution for person identification in industrial environments by effectively combining visual tracking and sensor measurements.

Abstract: Worker location data is key to higher productivity in industrial sites. Cameras are a promising tool for localization in logistics warehouses since they also offer valuable environmental contexts such as package status. However, identifying individuals with only visual data is often impractical. Accordingly, several prior studies identified people in videos by comparing their trajectories and wearable sensor measurements. While this approach has advantages such as independence from appearance, the existing methods may break down under real-world conditions. To overcome this challenge, we propose CorVS, a novel data-driven person identification method based on correspondence between visual tracking trajectories and sensor measurements. Firstly, our deep learning model predicts correspondence probabilities and reliabilities for every pair of a trajectory and sensor measurements. Secondly, our algorithm matches the trajectories and sensor measurements over time using the predicted probabilities and reliabilities. We developed a dataset with actual warehouse operations and demonstrated the method’s effectiveness for real-world applications.

[375] Efficient Generative AI Boosts Probabilistic Forecasting of Sudden Stratospheric Warmings

Ningning Tao, Fei Xie, Baoxiang Pan, Hongyu Wang, Han Huang, Zhongpu Qiu, Ke Gui, Jiali Luo, Xiaosong Chen

Main category: cs.LG

TL;DR: FM-Cast is a generative AI model using Flow Matching for efficient probabilistic forecasting of Sudden Stratospheric Warmings (SSWs), achieving comparable performance to leading NWP systems with much faster computation.

DetailsMotivation: SSWs are crucial for subseasonal predictability but remain challenging to forecast accurately due to limitations in physical representation, initialization, and computational demands of ensemble forecasts in traditional NWP systems.

Method: Developed a Flow Matching-based generative AI model (FM-Cast) for probabilistic forecasting of stratospheric circulation evolution, evaluated across 18 major SSW events from 1998-2024.

Result: FM-Cast skillfully forecasted onset, intensity, and morphology of 10 SSW events up to 20 days in advance with ensemble accuracies above 50%. It performed comparably or better than leading NWP systems while requiring only 2 minutes for 50-member, 30-day forecasts on consumer GPU.

Conclusion: The work establishes a computationally efficient paradigm for probabilistic forecasting of stratospheric anomalies and demonstrates generative AI’s potential to enhance physical understanding of atmosphere-climate dynamics, particularly distinguishing between troposphere-forced and stratosphere-driven SSW events.

Abstract: Sudden Stratospheric Warmings (SSWs) are key sources of subseasonal predictability and major drivers of extreme winter weather. Yet, their accurate and efficient forecast remains a persistent challenge for numerical weather prediction (NWP) systems due to limitations in physical representation, initialization, and the immense computational demands of ensemble forecasts. While data-driven forecasting is rapidly evolving, its application to the complex, three-dimensional dynamics of SSWs, particularly for probabilistic forecast, remains underexplored. Here, we bridge this gap by developing a Flow Matching-based generative AI model (FM-Cast) for efficient and skillful probabilistic forecasting of the spatiotemporal evolution of stratospheric circulation. Evaluated across 18 major SSW events (1998-2024), FM-Cast skillfully forecasts the onset, intensity, and morphology of 10 events up to 20 days in advance, achieving ensemble accuracies above 50%. Its performance is comparable to or exceeds leading NWP systems while requiring only two minutes for a 50-member, 30-day forecast on a consumer GPU. Furthermore, leveraging FM-Cast as a scientific tool, we demonstrate through idealized experiments that SSW predictability is fundamentally linked to its underlying physical drivers, distinguishing between events forced from the troposphere and those driven by internal stratospheric dynamics. Our work thus establishes a computationally efficient paradigm for probabilistic forecasting stratospheric anomalies and showcases generative AI’s potential to deepen the physical understanding of atmosphere-climate dynamics.

[376] Multi-Task Learning Based on Support Vector Machines and Twin Support Vector Machines: A Comprehensive Survey

Fatemeh Bazikar, Hossein Moosaei, Atefeh Hemmati, Panos M. Pardalos

Main category: cs.LG

TL;DR: This chapter surveys multi-task learning approaches using Support Vector Machines (SVMs) and Twin SVMs (TWSVMs), comparing their theoretical properties, optimization strategies, and empirical performance across various applications.

DetailsMotivation: While deep learning dominates recent MTL research, SVMs and TWSVMs remain relevant due to their interpretability, theoretical rigor, and effectiveness with small datasets, especially in data-scarce or high-dimensional scenarios.

Method: The survey examines MTL approaches based on SVM and TWSVM, focusing on shared representations, task regularization, and structural coupling strategies. It also explores emerging TWSVM extensions for multi-task settings.

Result: The chapter provides a comprehensive comparison of these models in terms of theoretical properties, optimization strategies, and empirical performance across applications in computer vision, natural language processing, and bioinformatics.

Conclusion: The work identifies research gaps and outlines future directions for building scalable, interpretable, and reliable margin-based MTL frameworks, serving as a comprehensive resource for researchers and practitioners.

Abstract: Multi-task learning (MTL) enables simultaneous training across related tasks, leveraging shared information to improve generalization, efficiency, and robustness, especially in data-scarce or high-dimensional scenarios. While deep learning dominates recent MTL research, Support Vector Machines (SVMs) and Twin SVMs (TWSVMs) remain relevant due to their interpretability, theoretical rigor, and effectiveness with small datasets. This chapter surveys MTL approaches based on SVM and TWSVM, highlighting shared representations, task regularization, and structural coupling strategies. Special attention is given to emerging TWSVM extensions for multi-task settings, which show promise but remain underexplored. We compare these models in terms of theoretical properties, optimization strategies, and empirical performance, and discuss applications in fields such as computer vision, natural language processing, and bioinformatics. Finally, we identify research gaps and outline future directions for building scalable, interpretable, and reliable margin-based MTL frameworks. This work provides a comprehensive resource for researchers and practitioners interested in SVM- and TWSVM-based multi-task learning.

[377] Personalized Treatment Outcome Prediction from Scarce Data via Dual-Channel Knowledge Distillation and Adaptive Fusion

Wenjie Chen, Li Zhuang, Ziying Luo, Yu Liu, Jiahao Wu, Shengcai Liu

Main category: cs.LG

TL;DR: CFKD-AFN uses low-fidelity simulation data to enhance treatment outcome predictions for rare patient groups when trial data is scarce, achieving significant accuracy improvements.

DetailsMotivation: Personalized treatment prediction for small-sample and rare patient groups is crucial in precision medicine, but costly trial data limits prediction performance.

Method: Cross-fidelity knowledge distillation and adaptive fusion network with dual-channel knowledge distillation from low-fidelity models and attention-guided fusion for multi-source information integration.

Result: Significant improvements over state-of-the-art methods (6.67% to 74.55% accuracy gains) on COPD treatment outcome prediction, with strong robustness to varying high-fidelity dataset sizes.

Conclusion: CFKD-AFN effectively leverages simulation data to enhance trial-based predictions and can be extended to an interpretable variant for clinical decision support.

Abstract: Personalized treatment outcome prediction based on trial data for small-sample and rare patient groups is critical in precision medicine. However, the costly trial data limit the prediction performance. To address this issue, we propose a cross-fidelity knowledge distillation and adaptive fusion network (CFKD-AFN), which leverages abundant but low-fidelity simulation data to enhance predictions on scarce but high-fidelity trial data. CFKD-AFN incorporates a dual-channel knowledge distillation module to extract complementary knowledge from the low-fidelity model, along with an attention-guided fusion module to dynamically integrate multi-source information. Experiments on treatment outcome prediction for the chronic obstructive pulmonary disease demonstrates significant improvements of CFKD-AFN over state-of-the-art methods in prediction accuracy, ranging from 6.67% to 74.55%, and strong robustness to varying high-fidelity dataset sizes. Furthermore, we extend CFKD-AFN to an interpretable variant, enabling the exploration of latent medical semantics to support clinical decision-making.

[378] Co-Evolving Latent Action World Models

Yucen Wang, Fengming Zhang, De-Chuan Zhan, Li Zhao, Kaixin Wang, Jiang Bian

Main category: cs.LG

TL;DR: CoLA-World enables joint training of latent action models and pre-trained world models through a warm-up phase, overcoming representational collapse and achieving superior video simulation and visual planning performance.

DetailsMotivation: Current two-stage approaches for adapting video generation models into controllable world models have redundant training and limited co-adaptation potential between latent action models and world models.

Method: Proposes CoLA-World with a critical warm-up phase that aligns representations between from-scratch latent action models and pre-trained world models, enabling joint training and co-evolution.

Result: Matches or outperforms prior two-stage methods in video simulation quality and downstream visual planning tasks.

Conclusion: Establishes a robust and efficient new paradigm for creating controllable world models through synergistic joint training of latent action models and world models.

Abstract: Adapting pre-trained video generation models into controllable world models via latent actions is a promising step towards creating generalist world models. The dominant paradigm adopts a two-stage approach that trains latent action model (LAM) and the world model separately, resulting in redundant training and limiting their potential for co-adaptation. A conceptually simple and appealing idea is to directly replace the forward dynamic model in LAM with a powerful world model and training them jointly, but it is non-trivial and prone to representational collapse. In this work, we propose CoLA-World, which for the first time successfully realizes this synergistic paradigm, resolving the core challenge in joint learning through a critical warm-up phase that effectively aligns the representations of the from-scratch LAM with the pre-trained world model. This unlocks a co-evolution cycle: the world model acts as a knowledgeable tutor, providing gradients to shape a high-quality LAM, while the LAM offers a more precise and adaptable control interface to the world model. Empirically, CoLA-World matches or outperforms prior two-stage methods in both video simulation quality and downstream visual planning, establishing a robust and efficient new paradigm for the field.

[379] Deep sequence models tend to memorize geometrically; it is unclear why

Shahriar Noroozizadeh, Vaishnavh Nagarajan, Elan Rosenfeld, Sanjiv Kumar

Main category: cs.LG

TL;DR: The paper contrasts associative vs geometric views of parametric memory in Transformers, showing models synthesize global geometric relationships between entities rather than just storing local co-occurrences, simplifying complex reasoning tasks.

DetailsMotivation: To challenge the predominant associative view of memory as brute-force lookup of co-occurrences and demonstrate that Transformers actually develop geometric representations encoding global relationships between entities.

Method: Isolated a clean Transformer reasoning instance incompatible with associative memory, analyzed neural embedding geometries, connected findings to Node2Vec, and identified spectral bias as the source of geometric representations.

Result: Found that Transformers naturally develop elegant geometric representations of atomic facts that encode global relationships, simplifying complex reasoning tasks into 1-step geometric operations, despite optimizing only over local associations.

Conclusion: The geometric view of parametric memory should replace default associative intuitions, revealing headroom for making Transformer memory more geometric and encouraging revisiting approaches in knowledge acquisition, capacity, discovery and unlearning.

Abstract: In sequence modeling, the parametric memory of atomic facts has been predominantly abstracted as a brute-force lookup of co-occurrences between entities. We contrast this associative view against a geometric view of how memory is stored. We begin by isolating a clean and analyzable instance of Transformer reasoning that is incompatible with memory as strictly a storage of the local co-occurrences specified during training. Instead, the model must have somehow synthesized its own geometry of atomic facts, encoding global relationships between all entities, including non-co-occurring ones. This in turn has simplified a hard reasoning task involving an $\ell$-fold composition into an easy-to-learn 1-step geometric task. From this phenomenon, we extract fundamental aspects of neural embedding geometries that are hard to explain. We argue that the rise of such a geometry, despite optimizing over mere local associations, cannot be straightforwardly attributed to typical architectural or optimizational pressures. Counterintuitively, an elegant geometry is learned even when it is not more succinct than a brute-force lookup of associations. Then, by analyzing a connection to Node2Vec, we demonstrate how the geometry stems from a spectral bias that – in contrast to prevailing theories – indeed arises naturally despite the lack of various pressures. This analysis also points to practitioners a visible headroom to make Transformer memory more strongly geometric. We hope the geometric view of parametric memory encourages revisiting the default intuitions that guide researchers in areas like knowledge acquisition, capacity, discovery and unlearning.

[380] Robust Graph Condensation via Classification Complexity Mitigation

Jiayi Luo, Qingyun Sun, Beining Yang, Haonan Yuan, Xingcheng Fu, Yanbiao Ma, Jianxin Li, Philip S. Yu

Main category: cs.LG

TL;DR: The paper proposes MRGC, a manifold-constrained robust graph condensation framework that addresses the vulnerability of existing graph condensation methods to adversarial attacks by preserving classification complexity reduction while ensuring robustness.

DetailsMotivation: Existing graph condensation methods perform poorly when the original graph is corrupted, and current robust graph learning techniques offer limited effectiveness. The authors found that graph condensation inherently reduces intrinsic dimensions but is highly vulnerable to adversarial perturbations.

Method: Proposed MRGC framework with three graph data manifold learning modules that constrain the condensed graph to lie within a smooth, low-dimensional manifold with minimal class ambiguity, preserving classification complexity reduction capability.

Result: Extensive experiments demonstrate that MRGC achieves robust performance across diverse attack scenarios, significantly outperforming existing methods when graphs are corrupted.

Conclusion: The manifold-constrained approach effectively addresses the robustness vulnerability in graph condensation while maintaining its core benefit of classification complexity reduction, making MRGC a reliable framework for robust graph condensation.

Abstract: Graph condensation (GC) has gained significant attention for its ability to synthesize smaller yet informative graphs. However, existing studies often overlook the robustness of GC in scenarios where the original graph is corrupted. In such cases, we observe that the performance of GC deteriorates significantly, while existing robust graph learning technologies offer only limited effectiveness. Through both empirical investigation and theoretical analysis, we reveal that GC is inherently an intrinsic-dimension-reducing process, synthesizing a condensed graph with lower classification complexity. Although this property is critical for effective GC performance, it remains highly vulnerable to adversarial perturbations. To tackle this vulnerability and improve GC robustness, we adopt the geometry perspective of graph data manifold and propose a novel Manifold-constrained Robust Graph Condensation framework named MRGC. Specifically, we introduce three graph data manifold learning modules that guide the condensed graph to lie within a smooth, low-dimensional manifold with minimal class ambiguity, thereby preserving the classification complexity reduction capability of GC and ensuring robust performance under universal adversarial attacks. Extensive experiments demonstrate the robustness of \ModelName\ across diverse attack scenarios.

[381] ReSpec: Towards Optimizing Speculative Decoding in Reinforcement Learning Systems

Qiaoling Chen, Zijun Liu, Peng Sun, Shenggui Li, Guoteng Wang, Ziming Liu, Yonggang Wen, Siyuan Feng, Tianwei Zhang

Main category: cs.LG

TL;DR: ReSpec is a system that adapts speculative decoding for reinforcement learning training of LLMs, achieving up to 4.5x speedup while maintaining training stability and reward convergence.

DetailsMotivation: RL training of LLMs is bottlenecked by generation time (over 75% of training time), and speculative decoding - while effective for serving - has unexplored behavior under RL training with issues like diminishing speedups, drafter staleness, and policy degradation.

Method: ReSpec adapts speculative decoding to RL through three mechanisms: dynamic SD configuration tuning, evolving the drafter via knowledge distillation, and weighting updates by rollout rewards.

Result: On Qwen models (3B-14B), ReSpec achieves up to 4.5x speedup while preserving reward convergence and training stability.

Conclusion: ReSpec provides a practical solution for efficient RL-based LLM adaptation by successfully adapting speculative decoding to the RL training context.

Abstract: Adapting large language models (LLMs) via reinforcement learning (RL) is often bottlenecked by the generation stage, which can consume over 75% of the training time. Speculative decoding (SD) accelerates autoregressive generation in serving systems, but its behavior under RL training remains largely unexplored. We identify three critical gaps that hinder the naive integration of SD into RL systems: diminishing speedups at large batch sizes, drafter staleness under continual actor updates, and drafter-induced policy degradation. To address these gaps, we present ReSpec, a system that adapts SD to RL through three complementary mechanisms: dynamically tuning SD configurations, evolving the drafter via knowledge distillation, and weighting updates by rollout rewards. On Qwen models (3B–14B), ReSpec achieves up to 4.5x speedup while preserving reward convergence and training stability, providing a practical solution for efficient RL-based LLM adaptation.

[382] Remote Labor Index: Measuring AI Automation of Remote Work

Mantas Mazeika, Alice Gatti, Cristina Menghini, Udari Madhushani Sehwag, Shivam Singhal, Yury Orlovskiy, Steven Basart, Manasi Sharma, Denis Peskoff, Elaine Lau, Jaehyuk Lim, Lachlan Carroll, Alice Blair, Vinaya Sivakumar, Sumana Basu, Brad Kenstler, Yuntao Ma, Julian Michael, Xiaoke Li, Oliver Ingebretsen, Aditya Mehta, Jean Mottola, John Teichmann, Kevin Yu, Zaina Shaik, Adam Khoja, Richard Ren, Jason Hausenloy, Long Phan, Ye Htet, Ankit Aich, Tahseen Rabbani, Vivswan Shah, Andriy Novykov, Felix Binder, Kirill Chugunov, Luis Ramirez, Matias Geralnik, Hernán Mesura, Dean Lee, Ed-Yeremai Hernandez Cardona, Annette Diamond, Summer Yue, Alexandr Wang, Bing Liu, Ernesto Hernandez, Dan Hendrycks

Main category: cs.LG

TL;DR: AI agents perform poorly on real-world economic tasks, achieving only 2.5% automation rate on the Remote Labor Index benchmark.

DetailsMotivation: To measure how AI research gains translate into actual economic value and automation in practical settings, bridging the gap between research benchmarks and real-world applications.

Method: Introduced the Remote Labor Index (RLI) - a multi-sector benchmark comprising real-world, economically valuable projects to evaluate end-to-end agent performance.

Result: AI agents performed near the floor on RLI, with the highest-performing agent achieving only 2.5% automation rate.

Conclusion: The results provide empirical evidence to ground discussions of AI automation, setting a baseline for tracking AI impacts and helping stakeholders navigate AI-driven labor automation.

Abstract: AIs have made rapid progress on research-oriented benchmarks of knowledge and reasoning, but it remains unclear how these gains translate into economic value and automation. To measure this, we introduce the Remote Labor Index (RLI), a broadly multi-sector benchmark comprising real-world, economically valuable projects designed to evaluate end-to-end agent performance in practical settings. AI agents perform near the floor on RLI, with the highest-performing agent achieving an automation rate of 2.5%. These results help ground discussions of AI automation in empirical evidence, setting a common basis for tracking AI impacts and enabling stakeholders to proactively navigate AI-driven labor automation.

[383] Quantum Gated Recurrent GAN with Gaussian Uncertainty for Network Anomaly Detection

Wajdi Hammami, Soumaya Cherkaoui, Jean-Frederic Laprade, Ola Ahmad, Shengrui Wang

Main category: cs.LG

TL;DR: A novel Quantum Gated Recurrent Unit-based Generative Adversarial Network with Successive Data Injection and multi-metric gating for robust network anomaly detection in time-series data, achieving 89.43% TaF1 score on quantum hardware.

DetailsMotivation: Anomaly detection in time-series data is critical for network security, and quantum machine learning approaches show promise but are constrained by limited qubit counts.

Method: Uses quantum-enhanced generator that outputs Gaussian distribution parameters via reparameterization, combined with Wasserstein critic for stable training. Employs novel gating mechanism using Gaussian uncertainty estimates, critic scores, and reconstruction errors for anomaly detection.

Result: Achieved 89.43% time-series aware F1 score (TaF1), outperforming existing classical and quantum models. Successfully deployed on real IBM Quantum hardware with retained high performance.

Conclusion: The QGRU-WGAN demonstrates superior anomaly detection capability and practical feasibility on current NISQ devices, confirming robustness and real-world applicability.

Abstract: Anomaly detection in time-series data is a critical challenge with significant implications for network security. Recent quantum machine learning approaches, such as quantum kernel methods and variational quantum circuits, have shown promise in capturing complex data distributions for anomaly detection but remain constrained by limited qubit counts. We introduce in this work a novel Quantum Gated Recurrent Unit (QGRU)-based Generative Adversarial Network (GAN) employing Successive Data Injection (SuDaI) and a multi-metric gating strategy for robust network anomaly detection. Our model uniquely utilizes a quantum-enhanced generator that outputs parameters (mean and log-variance) of a Gaussian distribution via reparameterization, combined with a Wasserstein critic to stabilize adversarial training. Anomalies are identified through a novel gating mechanism that initially flags potential anomalies based on Gaussian uncertainty estimates and subsequently verifies them using a composite of critic scores and reconstruction errors. Evaluated on benchmark datasets, our method achieves a high time-series aware F1 score (TaF1) of 89.43% demonstrating superior capability in detecting anomalies accurately and promptly as compared to existing classical and quantum models. Furthermore, the trained QGRU-WGAN was deployed on real IBM Quantum hardware, where it retained high anomaly detection performance, confirming its robustness and practical feasibility on current noisy intermediate-scale quantum (NISQ) devices.

[384] Defeating the Training-Inference Mismatch via FP16

Penghui Qi, Zichen Liu, Xiangxin Zhou, Tianyu Pang, Chao Du, Wee Sun Lee, Min Lin

Main category: cs.LG

TL;DR: Using FP16 instead of BF16 in RL fine-tuning of LLMs eliminates numerical mismatch between training and inference, leading to more stable optimization, faster convergence, and better performance.

DetailsMotivation: RL fine-tuning of LLMs suffers from instability due to numerical mismatch between training and inference policies, which prior work tried to address through algorithmic corrections or engineering alignments.

Method: Simply reverting from BF16 to FP16 precision, which is fully supported by modern frameworks with minimal code changes and requires no modifications to model architecture or learning algorithm.

Result: FP16 yields more stable optimization, faster convergence, and stronger performance across diverse tasks, algorithms and frameworks compared to BF16.

Conclusion: The findings suggest reconsidering precision trade-offs in RL fine-tuning, as FP16 effectively eliminates the numerical mismatch issue that BF16 introduces through large rounding errors.

Abstract: Reinforcement learning (RL) fine-tuning of large language models (LLMs) often suffers from instability due to the numerical mismatch between the training and inference policies. While prior work has attempted to mitigate this issue through algorithmic corrections or engineering alignments, we show that its root cause lies in the floating point precision itself. The widely adopted BF16, despite its large dynamic range, introduces large rounding errors that breaks the consistency between training and inference. In this work, we demonstrate that simply reverting to \textbf{FP16} effectively eliminates this mismatch. The change is simple, fully supported by modern frameworks with only a few lines of code change, and requires no modification to the model architecture or learning algorithm. Our results suggest that using FP16 uniformly yields more stable optimization, faster convergence, and stronger performance across diverse tasks, algorithms and frameworks. We hope these findings motivate a broader reconsideration of precision trade-offs in RL fine-tuning.

[385] Data-Efficient RLVR via Off-Policy Influence Guidance

Erle Zhu, Dazhi Jiang, Yuan Wang, Xujun Li, Jiale Cheng, Yuxian Gu, Yilin Niu, Aohan Zeng, Jie Tang, Minlie Huang, Hongning Wang

Main category: cs.LG

TL;DR: Proposes CROPI, a curriculum RL framework using off-policy influence estimation for efficient data selection in RLVR, achieving 2.66x acceleration with only 10% data per stage.

DetailsMotivation: Current data selection methods for RLVR are heuristic-based and lack theoretical guarantees, limiting their effectiveness and generalizability.

Method: Uses influence functions to estimate data contribution, off-policy estimation with pre-collected trajectories to avoid costly rollouts, and sparse random projection for gradient dimensionality reduction.

Result: CROPI achieves 2.66x step-level acceleration on 1.5B model using only 10% of data per stage compared to full-dataset training, with experiments up to 7B parameters.

Conclusion: Influence-based data selection shows substantial potential for efficient RLVR training.

Abstract: Data selection is a critical aspect of Reinforcement Learning with Verifiable Rewards (RLVR) for enhancing the reasoning capabilities of large language models (LLMs). Current data selection methods are largely heuristic-based, lacking theoretical guarantees and generalizability. This work proposes a theoretically-grounded approach using influence functions to estimate the contribution of each data point to the learning objective. To overcome the prohibitive computational cost of policy rollouts required for online influence estimation, we introduce an off-policy influence estimation method that efficiently approximates data influence using pre-collected offline trajectories. Furthermore, to manage the high-dimensional gradients of LLMs, we employ sparse random projection to reduce dimensionality and improve storage and computation efficiency. Leveraging these techniques, we develop \textbf{C}urriculum \textbf{R}L with \textbf{O}ff-\textbf{P}olicy \text{I}nfluence guidance (\textbf{CROPI}), a multi-stage RL framework that iteratively selects the most influential data for the current policy. Experiments on models up to 7B parameters demonstrate that CROPI significantly accelerates training. On a 1.5B model, it achieves a 2.66x step-level acceleration while using only 10% of the data per stage compared to full-dataset training. Our results highlight the substantial potential of influence-based data selection for efficient RLVR.

[386] Multiclass Local Calibration With the Jensen-Shannon Distance

Cesare Barbera, Lorenzo Perini, Giovanni De Toni, Andrea Passerini, Andrea Pugnana

Main category: cs.LG

TL;DR: The paper introduces multiclass local calibration to address proximity bias in ML models, proposes a method using Jensen-Shannon distance, and validates it against existing techniques.

DetailsMotivation: Existing multiclass calibration methods lack distance awareness, making them vulnerable to proximity bias where predictions in sparse feature regions are systematically miscalibrated, which is critical in high-stakes settings like healthcare.

Method: Proposes a practical method for enhancing local calibration in Neural Networks by enforcing alignment between predicted probabilities and local estimates of class frequencies using Jensen-Shannon distance.

Result: The approach is empirically validated against existing multiclass calibration techniques, though specific performance metrics are not detailed in the abstract.

Conclusion: Multiclass local calibration addresses the proximity bias limitation of existing methods and provides a more robust calibration framework, especially for high-stakes applications.

Abstract: Developing trustworthy Machine Learning (ML) models requires their predicted probabilities to be well-calibrated, meaning they should reflect true-class frequencies. Among calibration notions in multiclass classification, strong calibration is the most stringent, as it requires all predicted probabilities to be simultaneously calibrated across all classes. However, existing approaches to multiclass calibration lack a notion of distance among inputs, which makes them vulnerable to proximity bias: predictions in sparse regions of the feature space are systematically miscalibrated. This is especially relevant in high-stakes settings, such as healthcare, where the sparse instances are exactly those most at risk of biased treatment. In this work, we address this main shortcoming by introducing a local perspective on multiclass calibration. First, we formally define multiclass local calibration and establish its relationship with strong calibration. Second, we theoretically analyze the pitfalls of existing evaluation metrics when applied to multiclass local calibration. Third, we propose a practical method for enhancing local calibration in Neural Networks, which enforces alignment between predicted probabilities and local estimates of class frequencies using the Jensen-Shannon distance. Finally, we empirically validate our approach against existing multiclass calibration techniques.

[387] Enhancing ECG Classification Robustness with Lightweight Unsupervised Anomaly Detection Filters

Mustafa Fuad Rifet Ibrahim, Maurice Meijer, Alexander Schlaefer, Peer Stelldinger

Main category: cs.LG

TL;DR: This paper proposes using optimized Unsupervised Anomaly Detection (UAD) as an upstream filter to improve reliability of ECG analysis on wearables by detecting Out-of-Distribution data like unseen pathologies and noisy signals.

DetailsMotivation: Deep learning models for ECG analysis on resource-constrained wearables face reliability issues when encountering OOD data, which can cause erroneous high-confidence predictions and compromise patient safety. Existing methods don't adequately address computational constraints or handle noise and unseen classes together.

Method: Benchmarked six UAD approaches (Deep SVDD, reconstruction-based models, Masked Anomaly Detection, normalizing flows, and diffusion models) optimized via Neural Architecture Search under strict resource constraints (≤512k parameters). Evaluated on PTB-XL and BUT QDB datasets for detecting OOD CVD classes and noisy signals.

Result: Deep SVDD consistently achieved the best trade-off between detection performance and efficiency. In deployment simulation, integrating optimized Deep SVDD filter with diagnostic classifier improved accuracy by up to 21 percentage points over classifier-only baseline.

Conclusion: Optimized UAD filters can effectively safeguard automated ECG analysis, enabling safer and more reliable continuous cardiovascular monitoring on wearable devices.

Abstract: Continuous electrocardiogram (ECG) monitoring via wearables offers significant potential for early cardiovascular disease (CVD) detection. However, deploying deep learning models for automated analysis in resource-constrained environments faces reliability challenges due to inevitable Out-of-Distribution (OOD) data. OOD inputs, such as unseen pathologies or noisecorrupted signals, often cause erroneous, high-confidence predictions by standard classifiers, compromising patient safety. Existing OOD detection methods either neglect computational constraints or address noise and unseen classes separately. This paper explores Unsupervised Anomaly Detection (UAD) as an independent, upstream filtering mechanism to improve robustness. We benchmark six UAD approaches, including Deep SVDD, reconstruction-based models, Masked Anomaly Detection, normalizing flows, and diffusion models, optimized via Neural Architecture Search (NAS) under strict resource constraints (at most 512k parameters). Evaluation on PTB-XL and BUT QDB datasets assessed detection of OOD CVD classes and signals unsuitable for analysis due to noise. Results show Deep SVDD consistently achieves the best trade-off between detection and efficiency. In a realistic deployment simulation, integrating the optimized Deep SVDD filter with a diagnostic classifier improved accuracy by up to 21 percentage points over a classifier-only baseline. This study demonstrates that optimized UAD filters can safeguard automated ECG analysis, enabling safer, more reliable continuous cardiovascular monitoring on wearables.

[388] Aeolus: A Multi-structural Flight Delay Dataset

Lin Xu, Xinyun Yuan, Yuxuan Liang, Suwan Yin, Yuankai Wu

Main category: cs.LG

TL;DR: Aeolus is a large-scale multi-modal flight delay dataset with tabular data, flight chains, and flight network graphs to address limitations of existing datasets and support flight delay prediction research.

DetailsMotivation: Existing flight delay datasets are limited to flat tabular structures and fail to capture spatiotemporal dynamics of delay propagation, creating a need for more comprehensive datasets.

Method: Created a multi-modal dataset with three aligned components: tabular data with operational/meteorological features, flight chains modeling sequential delay propagation, and flight network graphs encoding resource connections.

Result: Aeolus provides over 50 million flights with comprehensive features, temporal splits, and leakage prevention to support realistic ML evaluation across regression, classification, temporal modeling, and graph learning tasks.

Conclusion: Aeolus fills a key gap for both domain-specific flight delay modeling and general-purpose structured data research, serving as a unified benchmark across tabular, sequential, and graph modalities.

Abstract: We introduce Aeolus, a large-scale Multi-modal Flight Delay Dataset designed to advance research on flight delay prediction and support the development of foundation models for tabular data. Existing datasets in this domain are typically limited to flat tabular structures and fail to capture the spatiotemporal dynamics inherent in delay propagation. Aeolus addresses this limitation by providing three aligned modalities: (i) a tabular dataset with rich operational, meteorological, and airportlevel features for over 50 million flights; (ii) a flight chain module that models delay propagation along sequential flight legs, capturing upstream and downstream dependencies; and (iii) a flight network graph that encodes shared aircraft, crew, and airport resource connections, enabling cross-flight relational reasoning. The dataset is carefully constructed with temporal splits, comprehensive features, and strict leakage prevention to support realistic and reproducible machine learning evaluation. Aeolus supports a broad range of tasks, including regression, classification, temporal structure modeling, and graph learning, serving as a unified benchmark across tabular, sequential, and graph modalities. We release baseline experiments and preprocessing tools to facilitate adoption. Aeolus fills a key gap for both domain-specific modeling and general-purpose structured data research.Our source code and data can be accessed at https://github.com/Flnny/Delay-data

[389] LLMs as In-Context Meta-Learners for Model and Hyperparameter Selection

Youssef Attia El Hili, Albert Thomas, Malik Tiomoko, Abdelhakim Benechehab, Corentin Léger, Corinne Ancourt, Balázs Kégl

Main category: cs.LG

TL;DR: LLMs can act as in-context meta-learners for model and hyperparameter selection by using dataset metadata, with both zero-shot and meta-informed prompting strategies showing competitive performance without expensive search.

DetailsMotivation: Model and hyperparameter selection typically requires expert intuition or expensive automated search, which is challenging and resource-intensive.

Method: Convert datasets into interpretable metadata and prompt LLMs using two strategies: zero-shot mode (relying on pretrained knowledge) and meta-informed mode (augmented with examples of models and their past performance).

Result: LLMs can exploit dataset metadata to recommend competitive models and hyperparameters without search, with meta-informed prompting showing improvements that demonstrate in-context meta-learning capability.

Conclusion: LLMs show promise as lightweight, general-purpose assistants for model selection and hyperparameter optimization through in-context meta-learning.

Abstract: Model and hyperparameter selection are critical but challenging in machine learning, typically requiring expert intuition or expensive automated search. We investigate whether large language models (LLMs) can act as in-context meta-learners for this task. By converting each dataset into interpretable metadata, we prompt an LLM to recommend both model families and hyperparameters. We study two prompting strategies: (1) a zero-shot mode relying solely on pretrained knowledge, and (2) a meta-informed mode augmented with examples of models and their performance on past tasks. Across synthetic and real-world benchmarks, we show that LLMs can exploit dataset metadata to recommend competitive models and hyperparameters without search, and that improvements from meta-informed prompting demonstrate their capacity for in-context meta-learning. These results highlight a promising new role for LLMs as lightweight, general-purpose assistants for model selection and hyperparameter optimization.

[390] Think Outside the Policy: In-Context Steered Policy Optimization

Hsiu-Yuan Huang, Chenming Tang, Weijie Liu, Saiyong Yang, Yunfang Wu

Main category: cs.LG

TL;DR: ICPO is a new RLVR framework that uses in-context learning to provide expert guidance without needing advanced model trajectories, improving exploration and training stability for large reasoning models.

DetailsMotivation: Existing RLVR methods like GRPO have limited exploration due to on-policy rollouts confined to current policy distribution, while approaches using expert models increase computational costs and require inaccessible advanced models.

Method: ICPO introduces Mixed-Policy GRPO with Implicit Expert Forcing to expand exploration, Expert Region Reject Sampling to filter unreliable trajectories, and Annealed Expert-Bonus Reward Shaping to balance expert guidance with autonomous improvement.

Result: ICPO consistently enhances reinforcement learning performance and training stability on mathematical reasoning benchmarks.

Conclusion: ICPO reveals a scalable and effective RLVR paradigm for large reasoning models that doesn’t require advanced model trajectories.

Abstract: Existing Reinforcement Learning from Verifiable Rewards (RLVR) methods, such as Group Relative Policy Optimization (GRPO), have achieved remarkable progress in improving the reasoning capabilities of Large Reasoning Models (LRMs). However, they exhibit limited exploration due to reliance on on-policy rollouts where confined to the current policy’s distribution, resulting in narrow trajectory diversity. Recent approaches attempt to expand policy coverage by incorporating trajectories generated from stronger expert models, yet this reliance increases computational cost and such advaned models are often inaccessible. To address these issues, we propose In-Context Steered Policy Optimization (ICPO), a unified framework that leverages the inherent in-context learning capability of LRMs to provide expert guidance using existing datasets. ICPO introduces Mixed-Policy GRPO with Implicit Expert Forcing, which expands exploration beyond the current policy distribution without requiring advanced LRM trajectories. To further stabilize optimization, ICPO integrates Expert Region Reject Sampling to filter unreliable off-policy trajectories and Annealed Expert-Bonus Reward Shaping to balance early expert guidance with later autonomous improvement. Results demonstrate that ICPO consistently enhances reinforcement learning performance and training stability on mathematical reasoning benchmarks, revealing a scalable and effective RLVR paradigm for LRMs.

[391] Clone Deterministic 3D Worlds with Geometrically-Regularized World Models

Zaishuo Xia, Yukuan Lu, Xinyi Li, Yifan Xu, Yubei Chen

Main category: cs.LG

TL;DR: GRWM improves world models by enforcing geometric regularization in latent space, leading to better representations and more stable long-horizon predictions without enlarging dynamics modules.

DetailsMotivation: Current world models degrade over long horizons due to poor representation quality from high-dimensional, lossy, or entangled latent representations that make dynamics learning difficult.

Method: Proposes Geometrically-Regularized World Models (GRWM) which enforces that consecutive points along sensory trajectories remain close in latent space, learning representations that align with the true environment topology.

Result: GRWM significantly improves rollout fidelity and stability across deterministic 3D settings and long-horizon prediction tasks, learning latent manifolds with superior geometric structure.

Conclusion: Improving representation learning through geometric regularization is a direct and effective path to building robust world models that deliver reliable long-horizon predictions.

Abstract: A world model is an internal model that simulates how the world evolves. Given past observations and actions, it predicts the future of both the embodied agent and its environment. Accurate world models are essential for enabling agents to think, plan, and reason effectively in complex, dynamic settings. Despite rapid progress, current world models remain brittle and degrade over long horizons. We argue that a central cause is representation quality: exteroceptive inputs (e.g., images) are high-dimensional, and lossy or entangled latents make dynamics learning unnecessarily hard. We therefore ask whether improving representation learning alone can substantially improve world-model performance. In this work, we take a step toward building a truly accurate world model by addressing a fundamental yet open problem: constructing a model that can fully clone and overfit to a deterministic 3D world. We propose Geometrically-Regularized World Models (GRWM), which enforces that consecutive points along a natural sensory trajectory remain close in latent representation space. This approach yields significantly improved latent representations that align closely with the true topology of the environment. GRWM is plug-and-play, requires only minimal architectural modification, scales with trajectory length, and is compatible with diverse latent generative backbones. Across deterministic 3D settings and long-horizon prediction tasks, GRWM significantly increases rollout fidelity and stability. Analyses show that its benefits stem from learning a latent manifold with superior geometric structure. These findings support a clear takeaway: improving representation learning is a direct and useful path to robust world models, delivering reliable long-horizon predictions without enlarging the dynamics module.

[392] On the limitation of evaluating machine unlearning using only a single training seed

Jamie Lanyon, Axel Finke, Petros Andreou, Georgina Cosma

Main category: cs.LG

TL;DR: Machine unlearning algorithms are sensitive to random training seeds, making single-model comparisons unreliable; empirical evaluations should account for variability across different training seeds.

DetailsMotivation: To address the unreliability of current empirical comparisons in machine unlearning, which often use multiple runs from the same trained model but ignore the sensitivity to random training seeds.

Method: Demonstrate through analysis that machine unlearning methods are highly sensitive to the random number seed used during model training, even for the same architecture and dataset.

Result: The study shows that common practices in machine unlearning evaluation can produce non-representative results due to seed sensitivity, highlighting the need for more robust comparison methods.

Conclusion: Empirical comparisons of machine unlearning algorithms should incorporate variability across different model training seeds to ensure representative and reliable performance assessments.

Abstract: Machine unlearning (MU) aims to remove the influence of certain data points from a trained model without costly retraining. Most practical MU algorithms are only approximate and their performance can only be assessed empirically. Care must therefore be taken to make empirical comparisons as representative as possible. A common practice is to run the MU algorithm multiple times independently starting from the same trained model. In this work, we demonstrate that this practice can give highly non-representative results because – even for the same architecture and same dataset – some MU methods can be highly sensitive to the choice of random number seed used for model training. We therefore recommend that empirical comphttps://info.arxiv.org/help/prep#commentsarisons of MU algorithms should also reflect the variability across different model training seeds.

[393] Polybasic Speculative Decoding Through a Theoretical Perspective

Ruilin Wang, Huixia Li, Yuexiao Ma, Xiawu Zheng, Fei Chao, Xuefeng Xiao, Rongrong Ji

Main category: cs.LG

TL;DR: The paper introduces a polybasic speculative decoding framework that accelerates LLM inference by extending beyond dualistic draft-verify approaches, achieving 3.31× to 4.43× speedup while preserving output distribution.

DetailsMotivation: Inference latency is a critical bottleneck in large-scale LLM deployment, and existing speculative decoding methods lack rigorous theoretical grounding and are limited to dualistic frameworks.

Method: Proposes a polybasic speculative decoding framework with comprehensive theoretical analysis, characterizing optimal inference time for multi-model systems and optimizing the interplay between model capabilities, acceptance lengths, and computational cost.

Result: Achieves speedup ratios of 3.31× to 4.01× for LLaMA2-Chat 7B, up to 3.87× for LLaMA3-8B, up to 4.43× for Vicuna-7B, and up to 3.85× for Qwen2-7B while preserving original output distribution.

Conclusion: The polybasic framework provides theoretical foundation for speculative decoding, enables practical acceleration, and supports both standalone implementation and integration with existing techniques.

Abstract: Inference latency stands as a critical bottleneck in the large-scale deployment of Large Language Models (LLMs). Speculative decoding methods have recently shown promise in accelerating inference without compromising the output distribution. However, existing work typically relies on a dualistic draft-verify framework and lacks rigorous theoretical grounding. In this paper, we introduce a novel \emph{polybasic} speculative decoding framework, underpinned by a comprehensive theoretical analysis. Specifically, we prove a fundamental theorem that characterizes the optimal inference time for multi-model speculative decoding systems, shedding light on how to extend beyond the dualistic approach to a more general polybasic paradigm. Through our theoretical investigation of multi-model token generation, we expose and optimize the interplay between model capabilities, acceptance lengths, and overall computational cost. Our framework supports both standalone implementation and integration with existing speculative techniques, leading to accelerated performance in practice. Experimental results across multiple model families demonstrate that our approach yields speedup ratios ranging from $3.31\times$ to $4.01\times$ for LLaMA2-Chat 7B, up to $3.87 \times$ for LLaMA3-8B, up to $4.43 \times$ for Vicuna-7B and up to $3.85 \times$ for Qwen2-7B – all while preserving the original output distribution. We release our theoretical proofs and implementation code to facilitate further investigation into polybasic speculative decoding.

[394] Non-Convex Over-the-Air Heterogeneous Federated Learning: A Bias-Variance Trade-off

Muhammad Faraz Ul Abrar, Nicolò Michelusi

Main category: cs.LG

TL;DR: This paper proposes a novel OTA-FL approach that allows structured model bias under heterogeneous wireless conditions, addressing limitations of prior zero-bias designs that are constrained by weakest devices and suffer from inflated variance in non-convex settings.

DetailsMotivation: Existing OTA-FL designs enforce zero-bias model updates by assuming homogeneous wireless conditions or forcing zero-bias, which becomes problematic under heterogeneous wireless scenarios where they are constrained by weakest devices and inflate update variance. Prior analyses also largely address convex objectives while modern AI models are non-convex.

Method: Developed novel OTA-FL SGD updates that allow structured, time-invariant model bias while facilitating reduced variance updates. Derived finite-time stationarity bound revealing bias-variance trade-off. Proposed non-convex joint OTA power-control design with efficient successive convex approximation (SCA) algorithm using only statistical CSI.

Result: Experiments on non-convex image classification task validate the approach: SCA-based design accelerates convergence via optimized bias and improves generalization over prior OTA-FL baselines.

Conclusion: The proposed biased OTA-FL framework effectively addresses wireless heterogeneity in non-convex settings by optimizing the bias-variance trade-off, leading to faster convergence and better generalization compared to traditional zero-bias approaches.

Abstract: Over-the-air (OTA) federated learning (FL) has been well recognized as a scalable paradigm that exploits the waveform superposition of the wireless multiple-access channel to aggregate model updates in a single use. Existing OTA-FL designs largely enforce zero-bias model updates by either assuming \emph{homogeneous} wireless conditions (equal path loss across devices) or forcing zero-bias updates to guarantee convergence. Under \emph{heterogeneous} wireless scenarios, however, such designs are constrained by the weakest device and inflate the update variance. Moreover, prior analyses of biased OTA-FL largely address convex objectives, while most modern AI models are highly non-convex. Motivated by these gaps, we study OTA-FL with stochastic gradient descent (SGD) for general smooth non-convex objectives under wireless heterogeneity. We develop novel OTA-FL SGD updates that allow a structured, time-invariant model bias while facilitating reduced variance updates. We derive a finite-time stationarity bound (expected time average squared gradient norm) that explicitly reveals a bias-variance trade-off. To optimize this trade-off, we pose a non-convex joint OTA power-control design and develop an efficient successive convex approximation (SCA) algorithm that requires only statistical CSI at the base station. Experiments on a non-convex image classification task validate the approach: the SCA-based design accelerates convergence via an optimized bias and improves generalization over prior OTA-FL baselines.

[395] Higher-Order Regularization Learning on Hypergraphs

Adrien Weihs, Andrea Bertozzi, Matthew Thorpe

Main category: cs.LG

TL;DR: The paper extends the theoretical foundation of Higher-Order Hypergraph Learning (HOHL) by proving consistency of a truncated version and deriving explicit convergence rates for supervised learning, while demonstrating strong empirical performance in active learning and non-geometric datasets.

DetailsMotivation: To extend the theoretical understanding of HOHL beyond prior asymptotic consistency analysis and demonstrate its practical effectiveness in diverse learning scenarios beyond geometric settings.

Method: Proving consistency of truncated HOHL, deriving explicit convergence rates for supervised learning regularization, and empirical evaluation in active learning and non-geometric datasets.

Result: Established theoretical consistency and convergence rates for truncated HOHL, and showed strong empirical performance across diverse learning settings including active learning and datasets without geometric structure.

Conclusion: HOHL demonstrates both theoretical soundness and practical versatility, proving effective across various learning paradigms including supervised learning, active learning, and non-geometric data structures.

Abstract: Higher-Order Hypergraph Learning (HOHL) was recently introduced as a principled alternative to classical hypergraph regularization, enforcing higher-order smoothness via powers of multiscale Laplacians induced by the hypergraph structure. Prior work established the well- and ill-posedness of HOHL through an asymptotic consistency analysis in geometric settings. We extend this theoretical foundation by proving the consistency of a truncated version of HOHL and deriving explicit convergence rates when HOHL is used as a regularizer in fully supervised learning. We further demonstrate its strong empirical performance in active learning and in datasets lacking an underlying geometric structure, highlighting HOHL’s versatility and robustness across diverse learning settings.

[396] A Three-Stage Bayesian Transfer Learning Framework to Improve Predictions in Data-Scarce Domains

Aidan Furlong, Robert Salko, Xingang Zhao, Xu Wu

Main category: cs.LG

TL;DR: This paper introduces staged B-DANN, a three-stage framework combining parameter transfer and domain adaptation to improve transfer learning under large domain shifts while providing uncertainty quantification.

DetailsMotivation: Deep neural networks require large datasets but experimental data are often sparse and noisy. Transfer learning helps but parameter transfer degrades under large domain shifts, and existing domain-adversarial methods lack uncertainty quantification and training stability.

Method: Three-stage framework: 1) Train deterministic feature extractor on source domain, 2) Adversarially refine using DANN, 3) Build Bayesian neural network on adapted feature extractor for target domain fine-tuning with uncertainty quantification.

Result: Validated on synthetic benchmark showing significant outperformance over standard transfer techniques. Applied to critical heat flux prediction in rectangular channels using tube experiment data as source domain, demonstrating improved predictive accuracy and generalization.

Conclusion: Staged B-DANN improves predictive accuracy and generalization while providing calibrated uncertainty estimates, potentially benefiting other nuclear engineering domains.

Abstract: The use of ML in engineering has grown steadily to support a wide array of applications. Among these methods, deep neural networks have been widely adopted due to their performance and accessibility, but they require large, high-quality datasets. Experimental data are often sparse, noisy, or insufficient to build resilient data-driven models. Transfer learning, which leverages relevant data-abundant source domains to assist learning in data-scarce target domains, has shown efficacy. Parameter transfer, where pretrained weights are reused, is common but degrades under large domain shifts. Domain-adversarial neural networks (DANNs) help address this issue by learning domain-invariant representations, thereby improving transfer under greater domain shifts in a semi-supervised setting. However, DANNs can be unstable during training and lack a native means for uncertainty quantification. This study introduces a fully-supervised three-stage framework, the staged Bayesian domain-adversarial neural network (staged B-DANN), that combines parameter transfer and shared latent space adaptation. In Stage 1, a deterministic feature extractor is trained on the source domain. This feature extractor is then adversarially refined using a DANN in Stage 2. In Stage 3, a Bayesian neural network is built on the adapted feature extractor for fine-tuning on the target domain to handle conditional shifts and yield calibrated uncertainty estimates. This staged B-DANN approach was first validated on a synthetic benchmark, where it was shown to significantly outperform standard transfer techniques. It was then applied to the task of predicting critical heat flux in rectangular channels, leveraging data from tube experiments as the source domain. The results of this study show that the staged B-DANN method can improve predictive accuracy and generalization, potentially assisting other domains in nuclear engineering.

[397] STaMP: Sequence Transformation and Mixed Precision for Low-Precision Activation Quantization

Marco Federici, Riccardo Del Chiaro, Boris van Breugel, Paul Whatmough, Markus Nagel

Main category: cs.LG

TL;DR: STaMP quantization applies linear transformations along the sequence dimension and uses mixed precision to maintain model accuracy at lower activation bit-widths by keeping a small number of tokens at higher precision.

DetailsMotivation: Quantization reduces inference latency, power, and memory footprint but often causes sharp accuracy degradation when activations are quantized below eight bits. Invertible linear transformations can help quantization by reparameterizing features.

Method: STaMP applies linear transformations along the sequence dimension to exploit local correlation in language and visual data, while keeping a small number of tokens in each intermediate activation at higher precision.

Result: STaMP significantly improves low bit width activation quantization and complements established activation and weight quantization methods, including recent feature transformations.

Conclusion: STaMP quantization is an effective strategy for maintaining model accuracy at lower average activation bit-widths by leveraging sequence transformations and mixed precision.

Abstract: Quantization is the key method for reducing inference latency, power and memory footprint of generative AI models. However, accuracy often degrades sharply when activations are quantized below eight bits. Recent work suggests that invertible linear transformations (e.g. rotations) can aid quantization, by reparameterizing feature channels and weights. In this paper, we propose \textit{Sequence Transformation and Mixed Precision} (STaMP) quantization, a novel strategy that applies linear transformations along the \textit{sequence} dimension to exploit the strong local correlation in language and visual data. By keeping a small number of tokens in each intermediate activation at higher precision, we can maintain model accuracy at lower (average) activations bit-widths. We evaluate STaMP on recent LVM and LLM architectures, demonstrating that it significantly improves low bit width activation quantization and complements established activation and weight quantization methods including recent feature transformations.

[398] Boosted Trees on a Diet: Compact Models for Resource-Constrained Devices

Jan Stenkamp, Nina Herrmann, Benjamin Karic, Stefan Oehmcke, Fabian Gieseke

Main category: cs.LG

TL;DR: A compression scheme for boosted decision trees that achieves 4-16x compression ratio while maintaining performance, enabling lightweight ML models for IoT devices with minimal computing power and energy requirements.

DetailsMotivation: Address the growing need for lightweight machine learning models that can be deployed on compute-constrained IoT devices, allowing autonomous operation without constant communication or external energy supply.

Method: Techniques for training compact boosted decision tree ensembles that reward feature and threshold reuse during training, using an adapted training process and alternative memory layout.

Result: Models achieved the same performance as LightGBM models with 4-16x compression ratio, enabling deployment on IoT devices with minimal computing power and energy requirements.

Conclusion: The compression scheme enables a wide range of IoT applications including remote monitoring, edge analytics, and real-time decision making in isolated or power-limited environments.

Abstract: Deploying machine learning models on compute-constrained devices has become a key building block of modern IoT applications. In this work, we present a compression scheme for boosted decision trees, addressing the growing need for lightweight machine learning models. Specifically, we provide techniques for training compact boosted decision tree ensembles that exhibit a reduced memory footprint by rewarding, among other things, the reuse of features and thresholds during training. Our experimental evaluation shows that models achieved the same performance with a compression ratio of 4-16x compared to LightGBM models using an adapted training process and an alternative memory layout. Once deployed, the corresponding IoT devices can operate independently of constant communication or external energy supply, and, thus, autonomously, requiring only minimal computing power and energy. This capability opens the door to a wide range of IoT applications, including remote monitoring, edge analytics, and real-time decision making in isolated or power-limited environments.

[399] Faithful and Fast Influence Function via Advanced Sampling

Jungyeon Koh, Hyeonsu Lyu, Jonggyu Jang, Hyun Jong Yang

Main category: cs.LG

TL;DR: Proposed two advanced sampling techniques (feature-based and logit-based) to improve influence function estimation by selecting representative subsets, reducing computation time by 30.1% and memory usage by 42.2% while maintaining or improving F1-score.

DetailsMotivation: Influence functions require computing Hessians which is resource-intensive for entire datasets. Random sampling leads to inconsistent estimates due to high variance, so better sampling methods are needed.

Method: Two advanced sampling techniques based on features and logits that select small but representative subsets by considering stochastic distribution of features or logits.

Result: Reduced computation time by 30.1% and memory usage by 42.2%, or improved F1-score by 2.5% compared to baseline in class removal experiments.

Conclusion: The proposed sampling methods provide more accurate influence function estimations with significant computational efficiency gains while maintaining model performance.

Abstract: How can we explain the influence of training data on black-box models? Influence functions (IFs) offer a post-hoc solution by utilizing gradients and Hessians. However, computing the Hessian for an entire dataset is resource-intensive, necessitating a feasible alternative. A common approach involves randomly sampling a small subset of the training data, but this method often results in highly inconsistent IF estimates due to the high variance in sample configurations. To address this, we propose two advanced sampling techniques based on features and logits. These samplers select a small yet representative subset of the entire dataset by considering the stochastic distribution of features or logits, thereby enhancing the accuracy of IF estimations. We validate our approach through class removal experiments, a typical application of IFs, using the F1-score to measure how effectively the model forgets the removed class while maintaining inference consistency on the remaining classes. Our method reduces computation time by 30.1% and memory usage by 42.2%, or improves the F1-score by 2.5% compared to the baseline.

[400] On Measuring Localization of Shortcuts in Deep Networks

Nikita Tsoy, Nikola Konstantinov

Main category: cs.LG

TL;DR: Shortcuts (spurious rules) in deep networks are distributed throughout all layers rather than localized, with shallow layers encoding spurious features and deeper layers forgetting core features. This distributed nature makes general shortcut-mitigation methods difficult to design.

DetailsMotivation: Shortcuts present a major challenge to deep network reliability, but their impact on feature representations remains understudied, obstructing the design of principled shortcut-mitigation methods.

Method: Novel experiment design quantifying layer-wise contribution to accuracy degradation through counterfactual training on clean and skewed datasets, tested on CIFAR-10, Waterbirds, and CelebA datasets across VGG, ResNet, DeiT, and ConvNeXt architectures.

Result: Shortcut learning is distributed throughout the network - shallow layers predominantly encode spurious features while deeper layers predominantly forget core features. Different network parts play different roles in this process.

Conclusion: The distributed nature of shortcut learning across layers suggests the hardness of designing general mitigation methods, supporting dataset- and architecture-specific approaches instead.

Abstract: Shortcuts, spurious rules that perform well during training but fail to generalize, present a major challenge to the reliability of deep networks (Geirhos et al., 2020). However, the impact of shortcuts on feature representations remains understudied, obstructing the design of principled shortcut-mitigation methods. To overcome this limitation, we investigate the layer-wise localization of shortcuts in deep models. Our novel experiment design quantifies the layer-wise contribution to accuracy degradation caused by a shortcut-inducing skew by counterfactual training on clean and skewed datasets. We employ our design to study shortcuts on CIFAR-10, Waterbirds, and CelebA datasets across VGG, ResNet, DeiT, and ConvNeXt architectures. We find that shortcut learning is not localized in specific layers but distributed throughout the network. Different network parts play different roles in this process: shallow layers predominantly encode spurious features, while deeper layers predominantly forget core features that are predictive on clean data. We also analyze the differences in localization and describe its principal axes of variation. Finally, our analysis of layer-wise shortcut-mitigation strategies suggests the hardness of designing general methods, supporting dataset- and architecture-specific approaches instead.

[401] Wasserstein Regression as a Variational Approximation of Probabilistic Trajectories through the Bernstein Basis

Maksim Maslov, Alexander Kugaevskikh, Matthew Ivanov

Main category: cs.LG

TL;DR: A new method for regression over distributions that combines Bernstein basis parameterization with Wasserstein distance minimization, providing geometric accuracy, computational practicality, and interpretability.

DetailsMotivation: Existing approaches for regression over distributions often ignore probability space geometry or are computationally expensive, limiting their practical application.

Method: Models conditional distributions as smooth probability trajectories using Bernstein polynomials to parameterize Gaussian components, with loss function based on squared Wasserstein distance between predicted and empirical distributions.

Result: Competitive performance on synthetic datasets with complex trajectories, showing better approximation quality in nonlinear cases, smooth trajectories, robustness to data structure changes, and high interpretability.

Conclusion: The method provides a balanced solution combining geometric accuracy, computational practicality, and interpretability, with future work including extensions to non-Gaussian distributions and high-dimensional data.

Abstract: This paper considers the problem of regression over distributions, which is becoming increasingly important in machine learning. Existing approaches often ignore the geometry of the probability space or are computationally expensive. To overcome these limitations, a new method is proposed that combines the parameterization of probability trajectories using a Bernstein basis and the minimization of the Wasserstein distance between distributions. The key idea is to model a conditional distribution as a smooth probability trajectory defined by a weighted sum of Gaussian components whose parameters – the mean and covariance – are functions of the input variable constructed using Bernstein polynomials. The loss function is the averaged squared Wasserstein distance between the predicted Gaussian distributions and the empirical data, which takes into account the geometry of the distributions. An autodiff-based optimization method is used to train the model. Experiments on synthetic datasets that include complex trajectories demonstrated that the proposed method provides competitive approximation quality in terms of the Wasserstein distance, Energy Distance, and RMSE metrics, especially in cases of pronounced nonlinearity. The model demonstrates trajectory smoothness that is better than or comparable to alternatives and robustness to changes in data structure, while maintaining high interpretability due to explicit parameterization via control points. The developed approach represents a balanced solution that combines geometric accuracy, computational practicality, and interpretability. Prospects for further research include extending the method to non-Gaussian distributions, applying entropy regularization to speed up computations, and adapting the approach to working with high-dimensional data for approximating surfaces and more complex structures.

[402] Omnipresent Yet Overlooked: Heat Kernels in Combinatorial Bayesian Optimization

Colin Doumont, Victor Picheny, Viacheslav Borovitskiy, Henry Moss

Main category: cs.LG

TL;DR: The paper develops a unifying framework based on heat kernels for Bayesian Optimization in combinatorial domains, showing that many existing combinatorial kernels are related or equivalent to heat kernels, and demonstrates that heat kernels achieve state-of-the-art performance.

DetailsMotivation: Bayesian Optimization requires specialized kernels for combinatorial domains, but relationships among existing combinatorial kernels are not well understood, creating a gap in theoretical understanding.

Method: Developed a unifying framework based on heat kernels, derived systematically with simple closed-form expressions, and validated through theoretical proofs and experiments.

Result: Proved many successful combinatorial kernels are related or equivalent to heat kernels; heat kernels are not sensitive to optima location unlike other algorithms; heat kernel pipeline achieves state-of-the-art results.

Conclusion: Heat kernels provide a powerful unifying framework for combinatorial Bayesian Optimization, offering theoretical insights and practical performance advantages over existing methods.

Abstract: Bayesian Optimization (BO) has the potential to solve various combinatorial tasks, ranging from materials science to neural architecture search. However, BO requires specialized kernels to effectively model combinatorial domains. Recent efforts have introduced several combinatorial kernels, but the relationships among them are not well understood. To bridge this gap, we develop a unifying framework based on heat kernels, which we derive in a systematic way and express as simple closed-form expressions. Using this framework, we prove that many successful combinatorial kernels are either related or equivalent to heat kernels, and validate this theoretical claim in our experiments. Moreover, our analysis confirms and extends the results presented in Bounce: certain algorithms’ performance decreases substantially when the unknown optima of the function do not have a certain structure. In contrast, heat kernels are not sensitive to the location of the optima. Lastly, we show that a fast and simple pipeline, relying on heat kernels, is able to achieve state-of-the-art results, matching or even outperforming certain slow or complex algorithms.

[403] MSAD: A Deep Dive into Model Selection for Time series Anomaly Detection

Emmanouil Sylligardos, John Paparrizos, Themis Palpanas, Pierre Senellart, Paul Boniol

Main category: cs.LG

TL;DR: This paper proposes using time series classification methods for model selection in anomaly detection, showing they outperform individual anomaly detection methods while maintaining similar execution times.

DetailsMotivation: No single best anomaly detection method exists for heterogeneous time series datasets, and existing AutoML solutions are not directly applicable to time series anomaly detection.

Method: Evaluated 234 model configurations from 16 base classifiers across 1980+ time series, using time series classification as model selection for anomaly detection.

Result: Model selection methods outperform every single anomaly detection method while maintaining similar execution times.

Conclusion: Time series classification algorithms provide accurate and efficient model selection for anomaly detection, establishing a strong baseline for AutoML pipelines.

Abstract: Anomaly detection is a fundamental task for time series analytics with important implications for the downstream performance of many applications. Despite increasing academic interest and the large number of methods proposed in the literature, recent benchmarks and evaluation studies demonstrated that no overall best anomaly detection methods exist when applied to very heterogeneous time series datasets. Therefore, the only scalable and viable solution to solve anomaly detection over very different time series collected from diverse domains is to propose a model selection method that will select, based on time series characteristics, the best anomaly detection methods to run. Existing AutoML solutions are, unfortunately, not directly applicable to time series anomaly detection, and no evaluation of time series-based approaches for model selection exists. Towards that direction, this paper studies the performance of time series classification methods used as model selection for anomaly detection. In total, we evaluate 234 model configurations derived from 16 base classifiers across more than 1980 time series, and we propose the first extensive experimental evaluation of time series classification as model selection for anomaly detection. Our results demonstrate that model selection methods outperform every single anomaly detection method while being in the same order of magnitude regarding execution time. This evaluation is the first step to demonstrate the accuracy and efficiency of time series classification algorithms for anomaly detection, and represents a strong baseline that can then be used to guide the model selection step in general AutoML pipelines. Preprint version of an article accepted at the VLDB Journal.

[404] Curly Flow Matching for Learning Non-gradient Field Dynamics

Katarina Petrović, Lazar Atanackovic, Viggo Moro, Kacper Kapuśniak, İsmail İlkan Ceylan, Michael Bronstein, Avishek Joey Bose, Alexander Tong

Main category: cs.LG

TL;DR: Curly-FM is a novel flow matching approach that learns non-gradient field dynamics by solving a Schrödinger bridge problem with non-zero drift reference processes, enabling modeling of periodic behaviors in systems like cell cycles and fluid dynamics.

DetailsMotivation: Current flow matching methods assume gradient field dynamics based on least action principle, but many real-world systems exhibit non-gradient, periodic behavior (e.g., cell cycles, ocean currents) that cannot be captured by existing approaches.

Method: Designs and solves a Schrödinger bridge problem with a non-zero drift reference process using inferred velocities and population snapshot data, in contrast to typical zero-drift reference processes.

Result: Curly-FM successfully learns trajectories that better match both reference processes and population marginals in single-cell RNA, computational fluid dynamics, and ocean current applications with approximate velocities.

Conclusion: Curly-FM expands flow matching models beyond population modeling to capture known periodic behavior in physical systems, addressing a fundamental limitation of current state-of-the-art methods.

Abstract: Modeling the transport dynamics of natural processes from population-level observations is a ubiquitous problem in the natural sciences. Such models rely on key assumptions about the underlying process in order to enable faithful learning of governing dynamics that mimic the actual system behavior. The de facto assumption in current approaches relies on the principle of least action that results in gradient field dynamics and leads to trajectories minimizing an energy functional between two probability measures. However, many real-world systems, such as cell cycles in single-cell RNA, are known to exhibit non-gradient, periodic behavior, which fundamentally cannot be captured by current state-of-the-art methods such as flow and bridge matching. In this paper, we introduce Curly Flow Matching (Curly-FM), a novel approach that is capable of learning non-gradient field dynamics by designing and solving a Schr"odinger bridge problem with a non-zero drift reference process – in stark contrast to typical zero-drift reference processes – which is constructed using inferred velocities in addition to population snapshot data. We showcase Curly-FM by solving the trajectory inference problems for single cells, computational fluid dynamics, and ocean currents with approximate velocities. We demonstrate that Curly-FM can learn trajectories that better match both the reference process and population marginals. Curly-FM expands flow matching models beyond the modeling of populations and towards the modeling of known periodic behavior in physical systems. Our code repository is accessible at: https://github.com/kpetrovicc/curly-flow-matching.git

[405] Tight Differentially Private PCA via Matrix Coherence

Tommaso d’Orsi, Gleb Novikov

Main category: cs.LG

TL;DR: A simple SVD-based algorithm achieves optimal private rank-r approximation with error bounds depending only on rank-r coherence and spectral gap, outperforming prior methods and matching non-private guarantees in some cases.

DetailsMotivation: To resolve an open question about computing top singular vectors under differential privacy with improved error bounds that depend on structural properties like coherence rather than worst-case assumptions.

Method: Uses singular value decomposition with standard perturbation mechanisms (Gaussian mechanism) to compute private rank-r approximations, leveraging the observation that coherence doesn’t increase under Gaussian perturbations.

Result: The algorithm achieves the same guarantees as optimal non-private algorithms for single-spike PCA in Wishart models, where prior private methods failed. Also shows coherence preservation under Gaussian perturbations.

Conclusion: The proposed simple algorithm provides optimal private approximations by exploiting coherence properties, with potential applications to graph problems and planted models where similar structural assumptions hold.

Abstract: We revisit the task of computing the span of the top $r$ singular vectors $u_1, \ldots, u_r$ of a matrix under differential privacy. We show that a simple and efficient algorithm – based on singular value decomposition and standard perturbation mechanisms – returns a private rank-$r$ approximation whose error depends only on the \emph{rank-$r$ coherence} of $u_1, \ldots, u_r$ and the spectral gap $\sigma_r - \sigma_{r+1}$. This resolves a question posed by Hardt and Roth~\cite{hardt2013beyond}. Our estimator outperforms the state of the art – significantly so in some regimes. In particular, we show that in the dense setting, it achieves the same guarantees for single-spike PCA in the Wishart model as those attained by optimal non-private algorithms, whereas prior private algorithms failed to do so. In addition, we prove that (rank-$r$) coherence does not increase under Gaussian perturbations. This implies that any estimator based on the Gaussian mechanism – including ours – preserves the coherence of the input. We conjecture that similar behavior holds for other structured models, including planted problems in graphs. We also explore applications of coherence to graph problems. In particular, we present a differentially private algorithm for Max-Cut and other constraint satisfaction problems under low coherence assumptions.

[406] LoRAQuant: Mixed-Precision Quantization of LoRA to Ultra-Low Bits

Amir Reza Mirzaei, Yuqiao Wen, Yanshuai Cao, Lili Mou

Main category: cs.LG

TL;DR: LoRAQuant is a mixed-precision quantization method for LoRA adapters that uses SVD reparameterization to enable ultra-low bitwidth quantization while maintaining performance comparable to higher-precision methods.

DetailsMotivation: Multiple LoRA adapters are often loaded simultaneously for LLM customization, but their aggregate computational cost becomes substantial at scale despite each being lightweight individually.

Method: Reparameterizes each LoRA adapter using SVD to concentrate important information into specific rows and columns, allowing important components to be quantized to higher precision while the rest can be quantized to ultra-low bitwidth.

Result: Experiments on LLaMA 2-7B, LLaMA 2-13B, and Mistral 7B models show LoRAQuant uses significantly lower bits than other quantization methods while achieving comparable or even higher performance on mathematical reasoning, coding, and summarization tasks.

Conclusion: LoRAQuant effectively reduces the computational cost of multiple LoRA adapters through intelligent mixed-precision quantization without sacrificing performance.

Abstract: Low-Rank Adaptation (LoRA) has become a popular technique for parameter-efficient fine-tuning of large language models (LLMs). In many real-world scenarios, multiple adapters are loaded simultaneously to enable LLM customization for personalized user experiences or to support a diverse range of tasks. Although each adapter is lightweight in isolation, their aggregate cost becomes substantial at scale. To address this, we propose LoRAQuant, a mixed-precision post-training quantization method tailored to LoRA. Specifically, LoRAQuant reparameterizes each adapter by singular value decomposition (SVD) to concentrate the most important information into specific rows and columns. This makes it possible to quantize the important components to higher precision, while quantizing the rest to ultra-low bitwidth. We conduct comprehensive experiments with LLaMA 2-7B, LLaMA 2-13B, and Mistral 7B models on mathematical reasoning, coding, and summarization tasks. Results show that our LoRAQuant uses significantly lower bits than other quantization methods, but achieves comparable or even higher performance.

[407] How Regularization Terms Make Invertible Neural Networks Bayesian Point Estimators

Nick Heilenkötter

Main category: cs.LG

TL;DR: The paper shows that specific regularization terms in invertible neural network training can recover Bayesian point estimators like posterior mean and MAP estimator upon network inversion.

DetailsMotivation: Invertible neural networks are attractive for inverse problems due to stability and interpretability, but existing optimization strategies have limitations from a Bayesian perspective.

Method: Introduce and analyze two regularization terms for network training that, when the network is inverted, recover properties of classical Bayesian point estimators - one connected to posterior mean and another resembling MAP estimator.

Result: Theoretical analysis characterizes how each loss shapes both the learned forward operator and its inverse reconstruction map. Numerical experiments demonstrate stable and interpretable data-dependence introduced by these loss-term regularizers.

Conclusion: Regularization terms in invertible neural network training can effectively recover Bayesian point estimators, providing stable and interpretable solutions for inverse problems.

Abstract: Can regularization terms in the training of invertible neural networks lead to known Bayesian point estimators in reconstruction? Invertible networks are attractive for inverse problems due to their inherent stability and interpretability. Recently, optimization strategies for invertible neural networks that approximate either a reconstruction map or the forward operator have been studied from a Bayesian perspective, but each has limitations. To address this, we introduce and analyze two regularization terms for the network training that, upon inversion of the network, recover properties of classical Bayesian point estimators: while the first can be connected to the posterior mean, the second resembles the MAP estimator. Our theoretical analysis characterizes how each loss shapes both the learned forward operator and its inverse reconstruction map. Numerical experiments support our findings and demonstrate how these loss-term regularizers introduce data-dependence in a stable and interpretable way.

[408] Budgeted Multiple-Expert Deferral

Giulia DeSalvo, Clara Mohri, Mehryar Mohri, Yutao Zhong

Main category: cs.LG

TL;DR: Proposes budgeted deferral framework to train deferral algorithms while minimizing expert query costs during training, with new algorithms for selective expert querying.

DetailsMotivation: Standard deferral training requires querying all experts for every instance, which is prohibitively expensive when expert queries are costly, undermining the goal of limiting unnecessary expert usage.

Method: Introduces budgeted deferral framework with new algorithms for two-stage and single-stage multiple-expert settings that selectively query only a subset of experts per training example, balancing cost and predictive performance.

Result: Theoretical guarantees including generalization bounds and label complexity analyses. Empirical results show substantial training cost reduction without sacrificing prediction accuracy across several domains.

Conclusion: Budget-aware deferral algorithms provide practical value by enabling effective deferral training while minimizing expert query costs, achieving the core goal of limiting unnecessary expert usage.

Abstract: Learning to defer uncertain predictions to costly experts offers a powerful strategy for improving the accuracy and efficiency of machine learning systems. However, standard training procedures for deferral algorithms typically require querying all experts for every training instance, an approach that becomes prohibitively expensive when expert queries incur significant computational or resource costs. This undermines the core goal of deferral: to limit unnecessary expert usage. To overcome this challenge, we introduce the budgeted deferral framework, which aims to train effective deferral algorithms while minimizing expert query costs during training. We propose new algorithms for both two-stage and single-stage multiple-expert deferral settings that selectively query only a subset of experts per training example. While inspired by active learning, our setting is fundamentally different: labels are already known, and the core challenge is to decide which experts to query in order to balance cost and predictive performance. We establish theoretical guarantees for both of our algorithms, including generalization bounds and label complexity analyses. Empirical results across several domains show that our algorithms substantially reduce training costs without sacrificing prediction accuracy, demonstrating the practical value of our budget-aware deferral algorithms.

[409] An All-Reduce Compatible Top-K Compressor for Communication-Efficient Distributed Learning

Chuyan Chen, Chenyang Ma, Zhangxin Li, Yutong He, Yanjie Dong, Kun Yuan

Main category: cs.LG

TL;DR: ARC-Top-K is a new gradient compression method that enables efficient All-Reduce operations while preserving globally important gradient information, achieving up to 60.7% faster training while maintaining Top-K accuracy.

DetailsMotivation: Existing gradient compressors have limitations: Rand-K discards structural information and performs poorly, while Top-K preserves informative entries but loses contraction property and requires costly All-Gather operations.

Method: ARC-Top-K aligns sparsity patterns across nodes using a lightweight sketch of the gradient, enabling index-free All-Reduce while preserving globally significant information. It’s provably contractive and combined with momentum error feedback (EF21M).

Result: ARC-Top-K matches the accuracy of Top-K while reducing wall-clock training time by up to 60.7%. It achieves linear speedup and sharper convergence rates than original EF21M under standard assumptions.

Conclusion: ARC-Top-K offers an efficient and scalable solution that combines the robustness of Rand-K with the strong performance of Top-K, addressing communication bottlenecks in large-scale distributed machine learning.

Abstract: Communication remains a central bottleneck in large-scale distributed machine learning, and gradient sparsification has emerged as a promising strategy to alleviate this challenge. However, existing gradient compressors face notable limitations: Rand-$K$\ discards structural information and performs poorly in practice, while Top-$K$\ preserves informative entries but loses the contraction property and requires costly All-Gather operations. In this paper, we propose ARC-Top-$K$, an {All-Reduce}-Compatible Top-$K$ compressor that aligns sparsity patterns across nodes using a lightweight sketch of the gradient, enabling index-free All-Reduce while preserving globally significant information. ARC-Top-$K$\ is provably contractive and, when combined with momentum error feedback (EF21M), achieves linear speedup and sharper convergence rates than the original EF21M under standard assumptions. Empirically, ARC-Top-$K$\ matches the accuracy of Top-$K$\ while reducing wall-clock training time by up to 60.7%, offering an efficient and scalable solution that combines the robustness of Rand-$K$\ with the strong performance of Top-$K$.

[410] LSM-MS2: A Foundation Model Bridging Spectral Identification and Biological Interpretation

Gabriel Asher, Devesh Shah, Amy A. Caudy, Luke Ferro, Lea Amar, Ana S. H. Costa, Thomas Patton, Niall O’Connor, Jennifer M. Campbell, Jack Geremia

Main category: cs.LG

TL;DR: LSM-MS2 is a deep learning foundation model that achieves state-of-the-art spectral identification performance, improving accuracy by 30% for challenging isomeric compounds and yielding 42% more correct identifications in complex biological samples.

DetailsMotivation: Most mass spectrometry data remains uncharacterized, leaving biological and chemical information untapped. Machine learning can help address this gap in spectral identification tasks.

Method: LSM-MS2 is a large-scale deep learning foundation model trained on millions of spectra to learn a semantic chemical space.

Result: The model improves identification accuracy by 30% for challenging isomeric compounds, yields 42% more correct identifications in complex biological samples, maintains robustness under low-concentration conditions, and produces rich spectral embeddings for biological interpretation.

Conclusion: LSM-MS2 enables direct biological interpretation from minimal downstream data and successfully differentiates disease states while predicting clinical outcomes across diverse translational applications.

Abstract: A vast majority of mass spectrometry data remains uncharacterized, leaving much of its biological and chemical information untapped. Recent advances in machine learning have begun to address this gap, particularly for tasks such as spectral identification in tandem mass spectrometry data. Here, we present the latest generation of LSM-MS2, a large-scale deep learning foundation model trained on millions of spectra to learn a semantic chemical space. LSM-MS2 achieves state-of-the-art performance in spectral identification, improving on existing methods by 30% in accuracy of identifying challenging isomeric compounds, yielding 42% more correct identifications in complex biological samples, and maintaining robustness under low-concentration conditions. Furthermore, LSM-MS2 produces rich spectral embeddings that enable direct biological interpretation from minimal downstream data, successfully differentiating disease states and predicting clinical outcomes across diverse translational applications.

[411] On Purely Private Covariance Estimation

Tommaso d’Orsi, Gleb Novikov

Main category: cs.LG

TL;DR: A simple perturbation mechanism for differentially private covariance matrix estimation that achieves optimal error bounds across various norms, particularly excelling in spectral norm for large datasets and improving Frobenius norm error for small datasets.

DetailsMotivation: To develop a differentially private covariance estimator that achieves optimal error guarantees across different matrix norms, addressing limitations of previous methods that didn't achieve optimal spectral norm error or had suboptimal performance for small datasets.

Method: A perturbation mechanism for covariance matrices under pure differential privacy, with an additional projection step onto the nuclear norm ball for small datasets to improve error bounds.

Result: For large datasets (n ≥ d²/ε), achieves optimal Frobenius norm error and best-known error for all p-Schatten norms, with information-theoretically optimal error for p ≥ 2. For small datasets (n < d²/ε), achieves optimal Frobenius norm error O(√(d·Tr(Σ)/n)), improving over previous bounds.

Conclusion: The proposed mechanism provides a unified approach that achieves optimal or near-optimal error guarantees across different matrix norms and dataset sizes, making it the first purely private covariance estimator with optimal spectral norm error.

Abstract: We present a simple perturbation mechanism for the release of $d$-dimensional covariance matrices $\Sigma$ under pure differential privacy. For large datasets with at least $n\geq d^2/\varepsilon$ elements, our mechanism recovers the provably optimal Frobenius norm error guarantees of \cite{nikolov2023private}, while simultaneously achieving best known error for all other $p$-Schatten norms, with $p\in [1,\infty]$. Our error is information-theoretically optimal for all $p\ge 2$, in particular, our mechanism is the first purely private covariance estimator that achieves optimal error in spectral norm. For small datasets $n< d^2/\varepsilon$, we further show that by projecting the output onto the nuclear norm ball of appropriate radius, our algorithm achieves the optimal Frobenius norm error $O(\sqrt{d;\text{Tr}(\Sigma) /n})$, improving over the known bounds of $O(\sqrt{d/n})$ of \cite{nikolov2023private} and ${O}\big(d^{3/4}\sqrt{\text{Tr}(\Sigma)/n}\big)$ of \cite{dong2022differentially}.

[412] Pre-trained Forecasting Models: Strong Zero-Shot Feature Extractors for Time Series Classification

Andreas Auer, Daniel Klotz, Sebastinan Böck, Sepp Hochreiter

Main category: cs.LG

TL;DR: Frozen pre-trained forecasting models can provide effective representations for time series classification, achieving accuracy comparable to or better than models pre-trained specifically for classification.

DetailsMotivation: To examine whether learned representations from time series forecasting models are generalizable to classification tasks, challenging the assumption that task-specific pre-training is necessary.

Method: Compare different representation extraction strategies and introduce two model-agnostic embedding augmentations for frozen pre-trained forecasting models.

Result: Best forecasting models achieve classification accuracy matching or surpassing state-of-the-art classification-specific models, with positive correlation between forecasting and classification performance.

Conclusion: Learning to forecast may provide a powerful route toward constructing general-purpose time series foundation models, reducing the need for task-specific pre-training.

Abstract: Recent research on time series foundation models has primarily focused on forecasting, leaving it unclear how generalizable their learned representations are. In this study, we examine whether frozen pre-trained forecasting models can provide effective representations for classification. To this end, we compare different representation extraction strategies and introduce two model-agnostic embedding augmentations. Our experiments show that the best forecasting models achieve classification accuracy that matches or even surpasses that of state-of-the-art models pre-trained specifically for classification. Moreover, we observe a positive correlation between forecasting and classification performance. These findings challenge the assumption that task-specific pre-training is necessary, and suggest that learning to forecast may provide a powerful route toward constructing general-purpose time series foundation models.

[413] Learning Pseudorandom Numbers with Transformers: Permuted Congruential Generators, Curricula, and Interpretability

Tao Tao, Maissam Barkeshli

Main category: cs.LG

TL;DR: Transformers can learn to predict sequences from Permuted Congruential Generators (PCGs), including challenging variants with bitwise operations, scaling to moduli up to 2^22 and handling single-bit outputs.

DetailsMotivation: To study Transformer models' capability to learn complex pseudo-random number generation sequences, particularly PCGs which are more difficult than linear congruential generators due to bitwise operations.

Method: Train Transformers on PCG sequences with various moduli (up to 2^22), using up to 50M parameters and datasets with up to 5B tokens. Analyze prediction performance, scaling laws, and embedding representations.

Result: Transformers successfully predict PCG sequences beyond classical attacks, even with single-bit outputs. Models can jointly learn multiple PRNGs and follow a sqrt(m) scaling law for required sequence length. Curriculum learning is essential for larger moduli.

Conclusion: Transformers demonstrate remarkable ability to learn complex PRNG patterns, revealing novel clustering phenomena in embeddings and establishing scaling relationships that inform training strategies for learning arithmetic structures.

Abstract: We study the ability of Transformer models to learn sequences generated by Permuted Congruential Generators (PCGs), a widely used family of pseudo-random number generators (PRNGs). PCGs introduce substantial additional difficulty over linear congruential generators (LCGs) by applying a series of bit-wise shifts, XORs, rotations and truncations to the hidden state. We show that Transformers can nevertheless successfully perform in-context prediction on unseen sequences from diverse PCG variants, in tasks that are beyond published classical attacks. In our experiments we scale moduli up to $2^{22}$ using up to $50$ million model parameters and datasets with up to $5$ billion tokens. Surprisingly, we find even when the output is truncated to a single bit, it can be reliably predicted by the model. When multiple distinct PRNGs are presented together during training, the model can jointly learn them, identifying structures from different permutations. We demonstrate a scaling law with modulus $m$: the number of in-context sequence elements required for near-perfect prediction grows as $\sqrt{m}$. For larger moduli, optimization enters extended stagnation phases; in our experiments, learning moduli $m \geq 2^{20}$ requires incorporating training data from smaller moduli, demonstrating a critical necessity for curriculum learning. Finally, we analyze embedding layers and uncover a novel clustering phenomenon: the model spontaneously groups the integer inputs into bitwise rotationally-invariant clusters, revealing how representations can transfer from smaller to larger moduli.

[414] Reward Collapse in Aligning Large Language Models

Ziang Song, Tianle Cai, Jason D. Lee, Weijie J. Su

Main category: cs.LG

TL;DR: The paper identifies ‘reward collapse’ - a phenomenon where ranking-based reward models trained on human preferences result in identical reward distributions regardless of prompts, and proposes a prompt-aware optimization scheme to address this issue.

DetailsMotivation: Current ranking-based approaches for aligning LLMs with human preferences lead to reward collapse, where prompts with different characteristics (open-ended vs. specific) receive identical reward distributions, which is undesirable for proper model training.

Method: The authors introduce a prompt-aware optimization scheme that incorporates prompt-related information during training, derived from theoretical analysis of the insufficiency of ranking-based objectives.

Result: Experimental results show that the proposed prompt-aware utility functions significantly alleviate reward collapse during reward model training.

Conclusion: Prompt-aware optimization effectively addresses reward collapse by enabling prompt-dependent reward distributions, improving the alignment of LLMs with human preferences.

Abstract: The extraordinary capabilities of large language models (LLMs) such as ChatGPT and GPT-4 are in part unleashed by aligning them with reward models that are trained on human preferences, which are often represented as rankings of responses to prompts. In this paper, we document the phenomenon of \textit{reward collapse}, an empirical observation where the prevailing ranking-based approach results in an \textit{identical} reward distribution \textit{regardless} of the prompts during the terminal phase of training. This outcome is undesirable as open-ended prompts like write a short story about your best friend'' should yield a continuous range of rewards for their completions, while specific prompts like what is the capital of New Zealand’’ should generate either high or low rewards. Our theoretical investigation reveals that reward collapse is primarily due to the insufficiency of the ranking-based objective function to incorporate prompt-related information during optimization. This insight allows us to derive closed-form expressions for the reward distribution associated with a set of utility functions in an asymptotic regime. To overcome reward collapse, we introduce a prompt-aware optimization scheme that provably admits a prompt-dependent reward distribution within the interpolating regime. Our experimental results suggest that our proposed prompt-aware utility functions significantly alleviate reward collapse during the training of reward models.

[415] Chaos-based reinforcement learning with TD3

Toshitaka Matsuki, Yusuke Sakemi, Kazuyuki Aihara

Main category: cs.LG

TL;DR: This paper introduces Twin Delayed Deep Deterministic Policy Gradients (TD3) to Chaos-based Reinforcement Learning (CBRL), showing that TD3 works effectively in CBRL for continuous action spaces and enables autonomous switching between exploration and exploitation.

DetailsMotivation: Previous CBRL methods lacked thorough learning algorithm development and didn't incorporate recent advances in reinforcement learning, particularly for deterministic and continuous action spaces.

Method: The study integrated TD3, a state-of-the-art deep reinforcement learning algorithm, into CBRL framework and tested it on a simple goal-reaching task while examining the effect of agent’s chaoticity on learning.

Result: TD3 successfully works as a learning algorithm for CBRL, enabling agents to autonomously suppress exploration as learning progresses and resume it when environment changes. Results show there’s an optimal range of chaos strength for flexible switching between exploration and exploitation.

Conclusion: TD3 is an effective learning algorithm for CBRL that enables adaptive behavior, with optimal performance achieved when agent’s chaoticity is within a suitable range for balancing exploration and exploitation.

Abstract: Chaos-based reinforcement learning (CBRL) is a method in which the agent’s internal chaotic dynamics drives exploration. However, the learning algorithms in CBRL have not been thoroughly developed in previous studies, nor have they incorporated recent advances in reinforcement learning. This study introduced Twin Delayed Deep Deterministic Policy Gradients (TD3), which is one of the state-of-the-art deep reinforcement learning algorithms that can treat deterministic and continuous action spaces, to CBRL. The validation results provide several insights. First, TD3 works as a learning algorithm for CBRL in a simple goal-reaching task. Second, CBRL agents with TD3 can autonomously suppress their exploratory behavior as learning progresses and resume exploration when the environment changes. Finally, examining the effect of the agent’s chaoticity on learning shows that there exists a suitable range of chaos strength in the agent’s model to flexibly switch between exploration and exploitation and adapt to environmental changes.

[416] A mathematical certification for positivity conditions in Neural Networks with applications to partial monotonicity and Trustworthy AI

Alejandro Polo-Molina, David Alfaya, Jose Portela

Main category: cs.LG

TL;DR: LipVor algorithm certifies if black-box models like ANNs are partially monotonic by leveraging Lipschitz continuity and Voronoi diagrams to verify positivity conditions without architectural constraints.

DetailsMotivation: ANNs are often excluded from critical applications like credit scoring due to their black-box nature and inability to guarantee partial monotonicity constraints, which are essential for trustworthiness.

Method: LipVor uses Lipschitz continuity to construct positive neighborhoods around evaluated points and employs Voronoi diagrams to provide sufficient conditions for certifying positivity across the entire domain.

Result: The algorithm successfully certifies partial monotonicity in ANNs without requiring constrained architectures or piece-wise linear activations, enabling their use in critical applications.

Conclusion: LipVor enables certification of partial monotonicity and other properties in unconstrained ANNs, potentially expanding their applicability to critical fields previously avoided due to trustworthiness concerns.

Abstract: Artificial Neural Networks (ANNs) have become a powerful tool for modeling complex relationships in large-scale datasets. However, their black-box nature poses trustworthiness challenges. In certain situations, ensuring trust in predictions might require following specific partial monotonicity constraints. However, certifying if an already-trained ANN is partially monotonic is challenging. Therefore, ANNs are often disregarded in some critical applications, such as credit scoring, where partial monotonicity is required. To address this challenge, this paper presents a novel algorithm (LipVor) that certifies if a black-box model, such as an ANN, is positive based on a finite number of evaluations. Consequently, since partial monotonicity can be expressed as a positivity condition on partial derivatives, LipVor can certify whether an ANN is partially monotonic. To do so, for every positively evaluated point, the Lipschitzianity of the black-box model is used to construct a specific neighborhood where the function remains positive. Next, based on the Voronoi diagram of the evaluated points, a sufficient condition is stated to certify if the function is positive in the domain. Unlike prior methods, our approach certifies partial monotonicity without constrained architectures or piece-wise linear activations. Therefore, LipVor could open up the possibility of using unconstrained ANN in some critical fields. Moreover, some other properties of an ANN, such as convexity, can be posed as positivity conditions, and therefore, LipVor could also be applied.

[417] Constrained Posterior Sampling: Time Series Generation with Hard Constraints

Sai Shankar Narasimhan, Shubhankar Agarwal, Litu Rout, Sanjay Shakkottai, Sandeep P. Chinchali

Main category: cs.LG

TL;DR: Constrained Posterior Sampling (CPS) is a diffusion-based algorithm that generates realistic time series while satisfying domain-specific hard constraints, outperforming existing methods in quality and scalability.

DetailsMotivation: Need for generating realistic time series samples that satisfy hard constraints for stress-testing models and protecting privacy, especially in engineering and safety-critical applications like power grid testing.

Method: CPS projects the posterior mean estimate into the constraint set after each denoising update in diffusion models, enabling handling of multiple constraints without additional training.

Result: CPS outperforms state-of-the-art methods by ~70% in sample quality and ~22% in similarity to real time series on stocks, traffic, and air quality datasets, while scaling to ~100 constraints.

Conclusion: CPS provides an effective and scalable solution for constrained time series generation with theoretical justification and superior empirical performance across multiple real-world domains.

Abstract: Generating realistic time series samples is crucial for stress-testing models and protecting user privacy by using synthetic data. In engineering and safety-critical applications, these samples must meet certain hard constraints that are domain-specific or naturally imposed by physics or nature. Consider, for example, generating electricity demand patterns with constraints on peak demand times. This can be used to stress-test the functioning of power grids during adverse weather conditions. Existing approaches for generating constrained time series are either not scalable or degrade sample quality. To address these challenges, we introduce Constrained Posterior Sampling (CPS), a diffusion-based sampling algorithm that aims to project the posterior mean estimate into the constraint set after each denoising update. Notably, CPS scales to a large number of constraints ($\sim100$) without requiring additional training. We provide theoretical justifications highlighting the impact of our projection step on sampling. Empirically, CPS outperforms state-of-the-art methods in sample quality and similarity to real time series by around 70% and 22%, respectively, on real-world stocks, traffic, and air quality datasets.

[418] In Defence of Post-hoc Explainability

Nick Oh

Main category: cs.LG

TL;DR: Post-hoc explainability methods are defended as legitimate scientific tools in ML, arguing they can produce knowledge through mediated understanding without requiring complete mechanistic transparency.

DetailsMotivation: To address criticism about the reliability and epistemic status of post-hoc explainability methods in machine learning, and to establish their legitimacy for scientific knowledge production.

Method: Develops a philosophical framework based on mediated understanding and bounded factivity, analyzing recent biomedical ML applications to demonstrate proper integration into scientific practice.

Result: Shows that post-hoc methods, when properly validated and acknowledging their approximative nature, can generate novel hypotheses and advance phenomenal understanding in scientific contexts.

Conclusion: Post-hoc explainability methods are valid scientific tools that can contribute to knowledge production when integrated with rigorous empirical validation and recognition of their bounded nature.

Abstract: This position paper defends post-hoc explainability methods as legitimate tools for scientific knowledge production in machine learning. Addressing criticism of these methods’ reliability and epistemic status, we develop a philosophical framework grounded in mediated understanding and bounded factivity. We argue that scientific insights can emerge through structured interpretation of model behaviour without requiring complete mechanistic transparency, provided explanations acknowledge their approximative nature and undergo rigorous empirical validation. Through analysis of recent biomedical ML applications, we demonstrate how post-hoc methods, when properly integrated into scientific practice, generate novel hypotheses and advance phenomenal understanding.

[419] Neural Networks for Learnable and Scalable Influence Estimation of Instruction Fine-Tuning Data

Ishika Agarwal, Dilek Hakkani-Tür

Main category: cs.LG

TL;DR: NN-CIFT uses small neural networks (InfluenceNetwork) to estimate influence values, achieving 99% cost reduction compared to traditional methods while maintaining performance.

DetailsMotivation: Existing influence function methods suffer from high computational costs, large memory requirements, and poor generalization, especially with large language models and datasets.

Method: Proposes NN-CIFT that trains small neural networks (0.0027% the size of full models) to estimate influence values, applied to subset selection for instruction fine-tuning.

Result: Achieves up to 99% cost reduction while showing no performance compromise compared to four state-of-the-art influence functions, with models just 0.0027% the size of 7B/8B language models.

Conclusion: Small neural networks can effectively estimate influence values with massive computational savings, enabling scalable influence analysis for large language models.

Abstract: Influence functions provide crucial insights into model training, but existing methods suffer from large computational costs and limited generalization. Particularly, recent works have proposed various metrics and algorithms to calculate the influence of data using language models, which do not scale well with large models and datasets. This is because of the expensive forward and backward passes required for computation, substantial memory requirements to store large models, and poor generalization of influence estimates to new data. In this paper, we explore the use of small neural networks – which we refer to as the InfluenceNetwork – to estimate influence values, achieving up to 99% cost reduction. Our evaluation demonstrates that influence values can be estimated with models just 0.0027% the size of full language models (we use 7B and 8B versions). We apply our algorithm of estimating influence values (called NN-CIFT: Neural Networks for effiCient Instruction Fine-Tuning) to the downstream task of subset selection for general instruction fine-tuning. In our study, we include four state-of-the-art influence functions and show no compromise in performance, despite large speedups, between NN-CIFT and the original influence functions. We provide an in-depth hyperparameter analyses of NN-CIFT. The code for our method can be found here: https://github.com/agarwalishika/NN-CIFT.

[420] Language Models can Self-Improve at State-Value Estimation for Better Search

Ethan Mendes, Alan Ritter

Main category: cs.LG

TL;DR: STL is a reward-free framework that improves language model value functions through explicit state transition reasoning, enabling more accurate value predictions and efficient search without labeled data.

DetailsMotivation: Collecting ground-truth rewards or human demonstrations for multi-step reasoning tasks is prohibitively expensive, especially in interactive domains like web tasks.

Method: STL trains a value LLM to simulate lookahead in natural language - predicting next action, resulting state, and value rationale, refining value estimates without labeled data through self-supervised learning.

Result: STL-trained value models boost web agent success rates by 39%, achieve comparable performance with proprietary models, generalize to multi-hop QA and math puzzles, and reduce inference costs by enabling efficient search.

Conclusion: STL enables small open-source models to guide efficient search by integrating explicit reasoning with value learning, reducing the need for expensive labeled data while maintaining strong performance.

Abstract: Collecting ground-truth rewards or human demonstrations for multi-step reasoning tasks is often prohibitively expensive, particularly in interactive domains such as web tasks. We introduce Self-Taught Lookahead (STL), a reward-free framework that improves language model-based value functions by reasoning explicitly about state transitions. STL can be viewed as a chain-of-thought analogue of the value iteration algorithm: instead of regressing directly on numeric values, a value LLM is trained to simulate a step of lookahead in natural language - predicting the next action, resulting state, and rationale for its value, thereby refining value estimates without any labeled data. This self-supervised procedure yields more accurate state-value predictions, which in turn enable lightweight search algorithms to expand fewer states while maintaining strong performance. Empirically, STL-trained value models built on moderately sized (8B parameter) open-weight LLMs boost web agent success rates by 39%, achieving comparable performance with proprietary models. STL also generalizes to multi-hop QA and math puzzles. We find that STL enables small open-source models to guide efficient search, reducing inference costs by integrating explicit reasoning with value learning.

[421] Guided Model Merging for Hybrid Data Learning: Leveraging Centralized Data to Refine Decentralized Models

Junyi Zhu, Ruicong Yao, Taha Ceritli, Savas Ozkan, Matthew B. Blaschko, Eunchung Noh, Jeongwon Min, Cho Jung Min, Mete Ozay

Main category: cs.LG

TL;DR: A framework that combines federated learning and model merging to train models in hybrid data regimes where both centralized and decentralized data coexist, achieving faster convergence and better performance than existing methods.

DetailsMotivation: Current training paradigms focus on either centralized or decentralized data, but real-world data availability is often hybrid. This setting presents opportunities as both regimes offer complementary trade-offs: decentralized data is abundant but heterogeneous, while centralized data enables better curation despite being limited.

Method: Proposes a framework that constructs a model atlas from decentralized models and leverages centralized data to refine a global model within this structured space. The refined model is then used to reinitialize the decentralized models, synergizing federated learning and model merging.

Result: Theoretically achieves faster convergence than decentralized-only methods due to variance reduction. Extensive experiments show consistent outperformance over purely centralized, purely decentralized, and existing hybrid-adaptable methods. Remains robust even when data domains differ or decentralized data contains noise.

Conclusion: The proposed framework effectively addresses hybrid data regimes by combining the strengths of both centralized and decentralized training paradigms, significantly broadening applicability in real-world scenarios with mixed data availability.

Abstract: Current network training paradigms primarily focus on either centralized or decentralized data regimes. However, in practice, data availability often exhibits a hybrid nature, where both regimes coexist. This hybrid setting presents new opportunities for model training, as the two regimes offer complementary trade-offs: decentralized data is abundant but subject to heterogeneity and communication constraints, while centralized data, though limited in volume and potentially unrepresentative, enables better curation and high-throughput access. Despite its potential, effectively combining these paradigms remains challenging, and few frameworks are tailored to hybrid data regimes. To address this, we propose a novel framework that constructs a model atlas from decentralized models and leverages centralized data to refine a global model within this structured space. The refined model is then used to reinitialize the decentralized models. Our method synergizes federated learning (to exploit decentralized data) and model merging (to utilize centralized data), enabling effective training under hybrid data availability. Theoretically, we show that our approach achieves faster convergence than methods relying solely on decentralized data, due to variance reduction in the merging process. Extensive experiments demonstrate that our framework consistently outperforms purely centralized, purely decentralized, and existing hybrid-adaptable methods. Notably, our method remains robust even when the centralized and decentralized data domains differ or when decentralized data contains noise, significantly broadening its applicability.

[422] Curriculum Abductive Learning

Wen-Chao Hu, Qi-Jie Li, Lin-Han Jia, Cunjing Ge, Yu-Feng Li, Yuan Jiang, Zhi-Hua Zhou

Main category: cs.LG

TL;DR: Curriculum Abductive Learning (C-ABL) addresses training instability in Abductive Learning by progressively partitioning the knowledge base into sub-bases, reducing abduction space and enabling stepwise logic incorporation.

DetailsMotivation: Traditional ABL suffers from training instability due to large abduction spaces from complex knowledge bases, treating knowledge as a static black box.

Method: C-ABL partitions the knowledge base into sequential sub-bases introduced progressively during training, reducing abduction space and enabling smooth logic incorporation.

Result: C-ABL outperforms previous ABL implementations, significantly improving training stability, convergence speed, and final accuracy, especially with complex knowledge.

Conclusion: Explicitly leveraging knowledge base structure through curriculum learning effectively addresses ABL training challenges and enhances performance.

Abstract: Abductive Learning (ABL) integrates machine learning with logical reasoning in a loop: a learning model predicts symbolic concept labels from raw inputs, which are revised through abduction using domain knowledge and then fed back for retraining. However, due to the nondeterminism of abduction, the training process often suffers from instability, especially when the knowledge base is large and complex, resulting in a prohibitively large abduction space. While prior works focus on improving candidate selection within this space, they typically treat the knowledge base as a static black box. In this work, we propose Curriculum Abductive Learning (C-ABL), a method that explicitly leverages the internal structure of the knowledge base to address the ABL training challenges. C-ABL partitions the knowledge base into a sequence of sub-bases, progressively introduced during training. This reduces the abduction space throughout training and enables the model to incorporate logic in a stepwise, smooth way. Experiments across multiple tasks show that C-ABL outperforms previous ABL implementations, significantly improves training stability, convergence speed, and final accuracy, especially under complex knowledge setting.

[423] Seek in the Dark: Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space

Hengli Li, Chenxi Li, Tong Wu, Xuekai Zhu, Yuxuan Wang, Zhaoxin Yu, Eric Hanchen Jiang, Song-Chun Zhu, Zixia Jia, Ying Nian Wu, Zilong Zheng

Main category: cs.LG

TL;DR: LatentSeek enhances LLM reasoning through test-time adaptation in latent space using policy gradient optimization, outperforming traditional methods while being efficient and scalable.

DetailsMotivation: Address limitations of current LLMs in reasoning ability, including catastrophic forgetting from training and limited novel data, by exploring test-time scaling in latent space instead of token space.

Method: LatentSeek framework uses Test-Time Instance-level Adaptation (TTIA) in latent space with policy gradient to iteratively update latent representations guided by self-generated reward signals.

Result: Outperforms strong baselines like Chain-of-Thought and fine-tuning methods on reasoning benchmarks (GSM8K, MATH-500, AIME2024) across multiple LLM architectures, converging quickly while benefiting from additional iterations.

Conclusion: LatentSeek provides a lightweight, scalable, and effective solution for enhancing LLM reasoning capabilities through test-time scaling in latent space, demonstrating the potential of this paradigm.

Abstract: Reasoning ability, a core component of human intelligence, continues to pose a significant challenge for Large Language Models (LLMs) in the pursuit of AGI. Although model performance has improved under the training scaling law, significant challenges remain, particularly with respect to training algorithms, such as catastrophic forgetting, and the limited availability of novel training data. As an alternative, test-time scaling enhances reasoning performance by increasing test-time computation without parameter updating. Unlike prior methods in this paradigm focused on token space, we propose leveraging latent space for more effective reasoning and better adherence to the test-time scaling law. We introduce LatentSeek, a novel framework that enhances LLM reasoning through Test-Time Instance-level Adaptation (TTIA) within the model’s latent space. Specifically, LatentSeek leverages policy gradient to iteratively update latent representations, guided by self-generated reward signals. LatentSeek is evaluated on a range of reasoning benchmarks, including GSM8K, MATH-500, and AIME2024, across multiple LLM architectures. Results show that LatentSeek consistently outperforms strong baselines, such as Chain-of-Thought prompting and fine-tuning-based methods. Furthermore, our analysis demonstrates that LatentSeek is highly efficient, typically converging within a few iterations for problems of average complexity, while also benefiting from additional iterations, thereby highlighting the potential of test-time scaling in the latent space. These findings position LatentSeek as a lightweight, scalable, and effective solution for enhancing the reasoning capabilities of LLMs.

[424] Learning to Insert for Constructive Neural Vehicle Routing Solver

Fu Luo, Xi Lin, Mengyuan Zhong, Fei Liu, Zhenkun Wang, Jianyong Sun, Qingfu Zhang

Main category: cs.LG

TL;DR: L2C-Insert is a novel neural combinatorial optimization method that uses insertion-based paradigm instead of traditional appending to solve vehicle routing problems, achieving superior performance.

DetailsMotivation: Existing constructive NCO methods use rigid appending-based approaches that lead to suboptimal results, motivating the exploration of more flexible insertion-based paradigms.

Method: Proposes L2C-Insert with three key components: novel model architecture for insertion position prediction, efficient training scheme, and advanced inference technique that exploits insertion flexibility.

Result: Extensive experiments on TSP and CVRP show L2C-Insert consistently achieves superior performance across various problem sizes on both synthetic and real-world instances.

Conclusion: The insertion-based paradigm significantly enhances flexibility and solution quality in neural combinatorial optimization for vehicle routing problems.

Abstract: Neural Combinatorial Optimisation (NCO) is a promising learning-based approach for solving Vehicle Routing Problems (VRPs) without extensive manual design. While existing constructive NCO methods typically follow an appending-based paradigm that sequentially adds unvisited nodes to partial solutions, this rigid approach often leads to suboptimal results. To overcome this limitation, we explore the idea of insertion-based paradigm and propose Learning to Construct with Insertion-based Paradigm (L2C-Insert), a novel learning-based method for constructive NCO. Unlike traditional approaches, L2C-Insert builds solutions by strategically inserting unvisited nodes at any valid position in the current partial solution, which can significantly enhance the flexibility and solution quality. The proposed framework introduces three key components: a novel model architecture for precise insertion position prediction, an efficient training scheme for model optimization, and an advanced inference technique that fully exploits the insertion paradigm’s flexibility. Extensive experiments on both synthetic and real-world instances of the Travelling Salesman Problem (TSP) and Capacitated Vehicle Routing Problem (CVRP) demonstrate that L2C-Insert consistently achieves superior performance across various problem sizes.

[425] Pass@K Policy Optimization: Solving Harder Reinforcement Learning Problems

Christian Walder, Deep Karkhanis

Main category: cs.LG

TL;DR: PKPO is a reinforcement learning method that optimizes pass@k performance by transforming rewards to prioritize joint utility of sample sets rather than individual samples, enabling better exploration and solving harder problems.

DetailsMotivation: Traditional RL optimizes pass@1 performance by rewarding individual samples independently, which underutilizes sampling capacity and limits exploration on harder examples by not considering the collective utility of sample sets.

Method: Proposed Pass-at-k Policy Optimization (PKPO) with novel low-variance unbiased estimators for pass@k and its gradient in both binary and continuous reward settings. The method transforms final rewards to optimize for sets of samples that maximize reward when considered jointly, and allows annealing k during training.

Result: PKPO effectively optimizes for target k, with higher k values enabling solving more and harder problems. Annealing k boosts both pass@1 and pass@k performance. On challenging tasks where conventional pass@1 optimization stalls, PKPO unblocks learning through better exploration.

Conclusion: PKPO enables robust optimization of pass@k for any k ≤ n, overcoming limitations of traditional RL that prioritize individual sample strength over diversity and collective utility of sample sets.

Abstract: Reinforcement Learning (RL) algorithms sample multiple n>1 solution attempts for each problem and reward them independently. This optimizes for pass@1 performance and prioritizes the strength of isolated samples at the expense of the diversity and collective utility of sets of samples. This under-utilizes the sampling capacity, limiting exploration and eventual improvement on harder examples. As a fix, we propose Pass-at-k Policy Optimization (PKPO), a transformation on the final rewards which leads to direct optimization of pass@k performance, thus optimizing for sets of samples that maximize reward when considered jointly. Our contribution is to derive novel low variance unbiased estimators for pass@k and its gradient, in both the binary and continuous reward settings. We show optimization with our estimators reduces to standard RL with rewards that have been jointly transformed by a stable and efficient transformation function. While previous efforts are restricted to k=n, ours is the first to enable robust optimization of pass@k for any arbitrary k <= n. Moreover, instead of trading off pass@1 performance for pass@k gains, our method allows annealing k during training, optimizing both metrics and often achieving strong pass@1 numbers alongside significant pass@k gains. We validate our reward transformations on toy experiments, which reveal the variance reducing properties of our formulations. We also include real-world examples using the open-source LLM, GEMMA-2. We find that our transformation effectively optimizes for the target k. Furthermore, higher k values enable solving more and harder problems, while annealing k boosts both the pass@1 and pass@k . Crucially, for challenging task sets where conventional pass@1 optimization stalls, our pass@k approach unblocks learning, likely due to better exploration by prioritizing joint utility over the utility of individual samples.

[426] Efficient Regression-Based Training of Normalizing Flows for Boltzmann Generators

Danyal Rehman, Oscar Davis, Jiarui Lu, Jian Tang, Michael Bronstein, Yoshua Bengio, Alexander Tong, Avishek Joey Bose

Main category: cs.LG

TL;DR: RegFlow is a novel regression-based training method for normalizing flows that replaces unstable maximum likelihood training with a simple L2-regression objective, enabling more stable and efficient training for Boltzmann Generators in molecular systems.

DetailsMotivation: Modern generative models like diffusion models have expensive inference, while classical normalizing flows for Boltzmann Generators suffer from unstable maximum likelihood training. There's a need for stable, efficient training methods for molecular conformation generation.

Method: RegFlow uses regression training where prior samples are mapped to targets computed via optimal transport couplings or pre-trained continuous normalizing flows. It employs regularization strategies including a forward-backward self-consistency loss for numerical stability.

Result: RegFlow enables training of previously intractable architectures for Boltzmann Generators and outperforms maximum likelihood training in performance, computational cost, and stability for equilibrium sampling of peptides (alanine dipeptide, tripeptide, tetrapeptide).

Conclusion: RegFlow provides a scalable and stable alternative to maximum likelihood training for normalizing flows, demonstrating strong potential for molecular system applications where fast likelihood evaluation is crucial.

Abstract: Simulation-free training frameworks have been at the forefront of the generative modelling revolution in continuous spaces, leading to large-scale diffusion and flow matching models. However, such modern generative models suffer from expensive inference, inhibiting their use in numerous scientific applications like Boltzmann Generators (BGs) for molecular conformations that require fast likelihood evaluation. In this paper, we revisit classical normalizing flows in the context of BGs that offer efficient sampling and likelihoods, but whose training via maximum likelihood is often unstable and computationally challenging. We propose Regression Training of Normalizing Flows (RegFlow), a novel and scalable regression-based training objective that bypasses the numerical instability and computational challenge of conventional maximum likelihood training in favour of a simple $\ell_2$-regression objective. Specifically, RegFlow maps prior samples under our flow to targets computed using optimal transport couplings or a pre-trained continuous normalizing flow (CNF). To enhance numerical stability, RegFlow employs effective regularization strategies such as a new forward-backward self-consistency loss that enjoys painless implementation. Empirically, we demonstrate that RegFlow unlocks a broader class of architectures that were previously intractable to train for BGs with maximum likelihood. We also show RegFlow exceeds the performance, computational cost, and stability of maximum likelihood training in equilibrium sampling in Cartesian coordinates of alanine dipeptide, tripeptide, and tetrapeptide, showcasing its potential in molecular systems.

[427] Incentivizing LLMs to Self-Verify Their Answers

Fuxiang Zhang, Jiacheng Xu, Chaojie Wang, Ce Cui, Yang Liu, Bo An

Main category: cs.LG

TL;DR: A framework that enables LLMs to self-verify their answers through unified reinforcement learning, eliminating the need for external reward models and enabling effective test-time scaling.

DetailsMotivation: Limited improvement from post-training on specific reasoning tasks due to distribution discrepancies between specialized generators and general reward models.

Method: Unified RL framework that trains models to both generate answers and verify their correctness within a single process, enabling self-verification without external tools.

Result: Models trained on Qwen2.5-Math-7B and DeepSeek-R1-Distill-Qwen-1.5B show improved post-training performance and effective test-time scaling across multiple mathematical reasoning benchmarks.

Conclusion: Self-verification through unified RL training enables LLMs to effectively scale performance at inference time without requiring external verifiers.

Abstract: Large Language Models (LLMs) have demonstrated remarkable progress in complex reasoning tasks through both post-training and test-time scaling laws. While prevalent test-time scaling approaches are often realized by using external reward models to guide the model generation process, we find that only marginal gains can be acquired when scaling a model post-trained on specific reasoning tasks. We identify that the limited improvement stems from distribution discrepancies between the specific post-trained generator and the general reward model. To address this, we propose a framework that incentivizes LLMs to self-verify their own answers. By unifying answer generation and verification within a single reinforcement learning (RL) process, we train models that can effectively assess the correctness of their own solutions. The trained model can further scale its performance at inference time by verifying its generations, without the need for external verifiers. We train our self-verification models based on Qwen2.5-Math-7B and DeepSeek-R1-Distill-Qwen-1.5B, demonstrating their capabilities across varying reasoning context lengths. Experiments on multiple mathematical reasoning benchmarks show that our models can not only improve post-training performance but also enable effective test-time scaling.

[428] Hysteresis Activation Function for Efficient Inference

Moshe Kimhi, Idan Kashani, Avi Mendelson, Chaim Baskin

Main category: cs.LG

TL;DR: HeLU is a hardware-efficient activation function that solves the dying ReLU problem using variable thresholds during backpropagation, achieving competitive performance without added complexity.

DetailsMotivation: ReLU suffers from the 'dying ReLU' problem where neurons stop activating during training, and existing solutions introduce hardware-inefficient complexity.

Method: Proposes Hysteresis Rectified Linear Unit (HeLU) with variable thresholds that refine backpropagation, maintaining hardware efficiency while preventing neuron death.

Result: Empirical evaluations show HeLU enhances model generalization across diverse datasets and achieves performance comparable to more complex activation functions.

Conclusion: HeLU offers an efficient solution for neural network inference that addresses the dying ReLU problem without introducing unnecessary complexity or hardware inefficiency.

Abstract: The widely used ReLU is favored for its hardware efficiency, {as the implementation at inference is a one bit sign case,} yet suffers from issues such as the dying ReLU'' problem, where during training, neurons fail to activate and constantly remain at zero, as highlighted by Lu et al. Traditional approaches to mitigate this issue often introduce more complex and less hardware-friendly activation functions. In this work, we propose a Hysteresis Rectified Linear Unit (HeLU), an efficient activation function designed to address the dying ReLU’’ problem with minimal complexity. Unlike traditional activation functions with fixed thresholds for training and inference, HeLU employs a variable threshold that refines the backpropagation. This refined mechanism allows simpler activation functions to achieve competitive performance comparable to their more complex counterparts without introducing unnecessary complexity or requiring inductive biases. Empirical evaluations demonstrate that HeLU enhances model generalization across diverse datasets, offering a promising solution for efficient and effective inference suitable for a wide range of neural network architectures.

[429] TabSTAR: A Tabular Foundation Model for Tabular Data with Text Fields

Alan Arazi, Eilam Shapira, Roi Reichart

Main category: cs.LG

TL;DR: TabSTAR is a tabular foundation model that introduces semantically target-aware representations, achieving state-of-the-art performance on tabular learning tasks with text features by leveraging pretrained text encoders and target tokens for task-specific embeddings.

DetailsMotivation: Deep learning has historically underperformed on tabular learning tasks compared to gradient boosting decision trees, and existing methods using language models for tabular tasks rely on static, target-agnostic textual representations that limit effectiveness.

Method: TabSTAR unfreezes a pretrained text encoder and uses target tokens as input to provide context for learning task-specific embeddings, with an architecture free of dataset-specific parameters to enable transfer learning on tabular data with textual features.

Result: TabSTAR achieves state-of-the-art performance for both medium- and large-sized datasets across known benchmarks of classification tasks with text features, and its pretraining phase exhibits scaling laws in the number of datasets.

Conclusion: TabSTAR offers a pathway for further performance improvements in tabular foundation models by enabling effective transfer learning and leveraging semantically target-aware representations.

Abstract: While deep learning has achieved remarkable success across many domains, it has historically underperformed on tabular learning tasks, which remain dominated by gradient boosting decision trees. However, recent advancements are paving the way for Tabular Foundation Models, which can leverage real-world knowledge and generalize across diverse datasets, particularly when the data contains free-text. Although incorporating language model capabilities into tabular tasks has been explored, most existing methods utilize static, target-agnostic textual representations, limiting their effectiveness. We introduce TabSTAR: a Tabular Foundation Model with Semantically Target-Aware Representations. TabSTAR is designed to enable transfer learning on tabular data with textual features, with an architecture free of dataset-specific parameters. It unfreezes a pretrained text encoder and takes as input target tokens, which provide the model with the context needed to learn task-specific embeddings. TabSTAR achieves state-of-the-art performance for both medium- and large-sized datasets across known benchmarks of classification tasks with text features, and its pretraining phase exhibits scaling laws in the number of datasets, offering a pathway for further performance improvements.

[430] Time Weaver: A Conditional Time Series Generation Model

Sai Shankar Narasimhan, Shubhankar Agarwal, Oguzhan Akcin, Sujay Sanghavi, Sandeep Chinchali

Main category: cs.LG

TL;DR: TIME WEAVER is a diffusion-based model for generating time series data using heterogeneous contextual metadata (categorical, continuous, time-variant variables) and introduces a novel evaluation metric for conditional time series generation.

DetailsMotivation: Current time series generation approaches ignore paired heterogeneous contextual metadata (like weather, location, EV presence), which is crucial for real-world applications like electricity demand forecasting during winter freezes.

Method: Developed TIME WEAVER, a diffusion-based model that leverages heterogeneous metadata including categorical, continuous, and time-variant variables to improve time series generation quality.

Result: TIME WEAVER outperforms state-of-the-art benchmarks (GANs) by up to 30% in downstream classification tasks across real-world energy, medical, air quality, and traffic datasets.

Conclusion: The proposed model successfully addresses the gap in conditional time series generation and the introduced evaluation metric better captures specificity and realism compared to naive adaptations from image domain metrics.

Abstract: Imagine generating a city’s electricity demand pattern based on weather, the presence of an electric vehicle, and location, which could be used for capacity planning during a winter freeze. Such real-world time series are often enriched with paired heterogeneous contextual metadata (e.g., weather and location). Current approaches to time series generation often ignore this paired metadata. Additionally, the heterogeneity in metadata poses several practical challenges in adapting existing conditional generation approaches from the image, audio, and video domains to the time series domain. To address this gap, we introduce TIME WEAVER, a novel diffusion-based model that leverages the heterogeneous metadata in the form of categorical, continuous, and even time-variant variables to significantly improve time series generation. Additionally, we show that naive extensions of standard evaluation metrics from the image to the time series domain are insufficient. These metrics do not penalize conditional generation approaches for their poor specificity in reproducing the metadata-specific features in the generated time series. Thus, we innovate a novel evaluation metric that accurately captures the specificity of conditional generation and the realism of the generated time series. We show that TIME WEAVER outperforms state-of-the-art benchmarks, such as Generative Adversarial Networks (GANs), by up to 30% in downstream classification tasks on real-world energy, medical, air quality, and traffic datasets.

[431] Parallel Unlearning in Inherited Model Networks

Xiao Liu, Mingyuan Li, Guangsheng Yu, Lixiang Li, Haipeng Peng, Ren Ping Liu

Main category: cs.LG

TL;DR: A novel parallel unlearning framework for models with inheritance relationships using Fisher Information Matrix to enable efficient one-shot knowledge removal while maintaining model performance.

DetailsMotivation: Unlearning is challenging in learning frameworks with continuous model growth and complex inheritance relationships, requiring efficient methods to remove inherited knowledge.

Method: Uses chronologically Directed Acyclic Graph (DAG) to capture unlearning scenarios, Fisher Inheritance Unlearning (FIUn) method with Fisher Information Matrix to assess parameter significance, and Merging-FIM (MFIM) function to consolidate multiple FIMs for parallel processing.

Result: Achieves 0% accuracy for unlearned labels and 94.53% for retained labels in single-class tasks; 1.07% for unlearned and 84.77% for retained in multi-class tasks; accelerates unlearning by 99% compared to alternative methods.

Conclusion: The framework enables efficient parallel unlearning in inherited model networks, supporting all DAG-captured scenarios with significant computational overhead reduction while maintaining model performance.

Abstract: Unlearning is challenging in generic learning frameworks with the continuous growth and updates of models exhibiting complex inheritance relationships. This paper presents a novel unlearning framework that enables fully parallel unlearning among models exhibiting inheritance. We use a chronologically Directed Acyclic Graph (DAG) to capture various unlearning scenarios occurring in model inheritance networks. Central to our framework is the Fisher Inheritance Unlearning (FIUn) method, designed to enable efficient parallel unlearning within the DAG. FIUn utilizes the Fisher Information Matrix (FIM) to assess the significance of model parameters for unlearning tasks and adjusts them accordingly. To handle multiple unlearning requests simultaneously, we propose the Merging-FIM (MFIM) function, which consolidates FIMs from multiple upstream models into a unified matrix. This design supports all unlearning scenarios captured by the DAG, enabling one-shot removal of inherited knowledge while significantly reducing computational overhead. Experiments confirm the effectiveness of our unlearning framework. For single-class tasks, it achieves complete unlearning with 0% accuracy for unlearned labels while maintaining 94.53% accuracy for retained labels. For multi-class tasks, the accuracy is 1.07% for unlearned labels and 84.77% for retained labels. Our framework accelerates unlearning by 99% compared to alternative methods. Code is in https://github.com/MJLee00/Parallel-Unlearning-in-Inherited-Model-Networks.

[432] Through the River: Understanding the Benefit of Schedule-Free Methods for Language Model Training

Minhak Song, Beomhan Baek, Kwangjun Ahn, Chulhee Yun

Main category: cs.LG

TL;DR: Schedule-Free (SF) method is a scalable alternative to conventional pretraining strategies that avoids explicit decay phases and weight averaging overhead while maintaining strong performance.

DetailsMotivation: Conventional pretraining strategies with fixed compute budgets are inadequate for large-scale training, and existing alternatives like WSD and weight averaging have limitations in flexibility and memory usage.

Method: Revisits the Schedule-Free (SF) method, analyzes its dynamics theoretically and empirically, and proposes a refined variant that improves robustness to momentum and large batch sizes.

Result: SF-AdamW effectively navigates loss landscape without decay phases or auxiliary averaging, and the refined variant addresses key limitations of the original method.

Conclusion: SF is established as a practical, scalable, and theoretically grounded approach for language model training.

Abstract: As both model and dataset sizes continue to scale rapidly, conventional pretraining strategies with fixed compute budgets-such as cosine learning rate schedules-are increasingly inadequate for large-scale training. Recent alternatives, including warmup-stable-decay (WSD) schedules and weight averaging, offer greater flexibility. However, WSD relies on explicit decay phases to track progress, while weight averaging addresses this limitation at the cost of additional memory. In search of a more principled and scalable alternative, we revisit the Schedule-Free (SF) method [Defazio et al., 2024], which has shown strong empirical performance across diverse settings. We show that SF-AdamW effectively navigates the “river” structure of the loss landscape without decay phases or auxiliary averaging, making it particularly suitable for continuously scaling training workloads. To understand this behavior, we conduct a theoretical and empirical analysis of SF dynamics, revealing that it implicitly performs weight averaging without memory overhead. Guided by this analysis, we propose a refined variant of SF that improves robustness to momentum and performs better under large batch sizes, addressing key limitations of the original method. Together, these results establish SF as a practical, scalable, and theoretically grounded approach for language model training.

[433] Dolphin: A Programmable Framework for Scalable Neurosymbolic Learning

Aaditya Naik, Jason Liu, Claire Wang, Amish Sethi, Saikat Dutta, Mayur Naik, Eric Wong

Main category: cs.LG

TL;DR: DOLPHIN is a neurosymbolic framework that integrates symbolic reasoning with deep learning, achieving state-of-the-art efficiency and convergence on complex benchmarks where existing frameworks fail.

DetailsMotivation: To address the scalability challenges in neurosymbolic learning when dealing with complex symbolic programs and large datasets.

Method: DOLPHIN supports neurosymbolic programs in Python, executing symbolic reasoning on CPU while vectorizing probabilistic computations and gradient propagation on GPU.

Result: Across 13 benchmarks spanning text, image, and video data, DOLPHIN converges to state-of-the-art accuracies on complex benchmarks while being 1.71x to 62x faster than existing frameworks like Scallop, ISED, and IndeCateR+.

Conclusion: DOLPHIN advances the scalability of neurosymbolic frameworks, achieving superior efficiency and convergence on difficult benchmarks where existing frameworks struggle.

Abstract: Neurosymbolic learning enables the integration of symbolic reasoning with deep learning but faces significant challenges in scaling to complex symbolic programs, large datasets, or both. We introduce DOLPHIN, a framework that tackles these challenges by supporting neurosymbolic programs in Python, executing complex symbolic reasoning on the CPU while vectorizing probabilistic computations and gradient propagation on the GPU. Across 13 benchmarks spanning tasks over text, image, and video data, with symbolic reasoning features like recursion and black-box functions, DOLPHIN converges to state-of-the-art accuracies on the more complex benchmarks while existing frameworks such as Scallop, ISED, and IndeCateR+ fail to converge within the time limit. On simpler benchmarks, DOLPHIN matches their performance, while achieving these results 1.71x to 62x faster than the baselines. Overall, DOLPHIN advances the scalability of neurosymbolic frameworks, achieving state-of-the-art efficiency and convergence on difficult benchmarks where existing frameworks struggle. The code is published at https://github.com/Dolphin-NeSy/Dolphin.

[434] Similarity-Distance-Magnitude Activations

Allen Schmaltz

Main category: cs.LG

TL;DR: Introduces SDM activation function and estimator for robust selective classification, improving on softmax with similarity and distance awareness.

DetailsMotivation: To create a more robust and interpretable alternative to softmax activation that handles co-variate shifts and out-of-distribution inputs better.

Method: Proposes SDM activation function with similarity, distance, and magnitude awareness, and SDM estimator for selective classification using empirical CDF partitioning.

Result: SDM estimator outperforms existing calibration methods with softmax activations on co-variate shifts and OOD inputs while maintaining in-distribution performance.

Conclusion: SDM provides a more robust and interpretable framework for selective classification that handles distribution shifts effectively.

Abstract: We introduce the Similarity-Distance-Magnitude (SDM) activation function, a more robust and interpretable formulation of the standard softmax activation function, adding Similarity (i.e., correctly predicted depth-matches into training) awareness and Distance-to-training-distribution awareness to the existing output Magnitude (i.e., decision-boundary) awareness, and enabling interpretability-by-exemplar via dense matching. We further introduce the SDM estimator, based on a data-driven partitioning of the class-wise empirical CDFs via the SDM activation, to control the class- and prediction-conditional accuracy among selective classifications. When used as the final-layer activation over pre-trained language models for selective classification, the SDM estimator is more robust to co-variate shifts and out-of-distribution inputs than existing calibration methods using softmax activations, while remaining informative over in-distribution data.

[435] Differentiation Through Black-Box Quadratic Programming Solvers

Connor W. Magoon, Fengyu Yang, Noam Aigerman, Shahar Z. Kovalsky

Main category: cs.LG

TL;DR: dQP is a modular, solver-agnostic framework for differentiating quadratic programming solutions that decouples computation from differentiation using active set information.

DetailsMotivation: Existing differentiable QP approaches rely on integrated solvers, limiting applicability in neural networks and bi-level optimization by restricting solver choices.

Method: Leverages active set information to express solution and derivatives using simplified linear systems with shared matrix, fully decoupling QP solution computation from differentiation.

Result: Open-source implementation integrates with 15+ state-of-the-art solvers, showing robustness and scalability especially in large-scale sparse problems.

Conclusion: dQP provides a plug-and-play differentiation framework that overcomes solver dependency limitations while maintaining efficiency and broad compatibility.

Abstract: Differentiable optimization has attracted significant research interest, particularly for quadratic programming (QP). Existing approaches for differentiating the solution of a QP with respect to its defining parameters often rely on specific integrated solvers. This integration limits their applicability, including their use in neural network architectures and bi-level optimization tasks, restricting users to a narrow selection of solver choices. To address this limitation, we introduce dQP, a modular and solver-agnostic framework for plug-and-play differentiation of virtually any QP solver. A key insight we leverage to achieve modularity is that, once the active set of inequality constraints is known, both the solution and its derivative can be expressed using simplified linear systems that share the same matrix. This formulation fully decouples the computation of the QP solution from its differentiation. Building on this result, we provide a minimal-overhead, open-source implementation ( https://github.com/cwmagoon/dQP ) that seamlessly integrates with over 15 state-of-the-art solvers. Comprehensive benchmark experiments demonstrate dQP’s robustness and scalability, particularly highlighting its advantages in large-scale sparse problems.

[436] SignalLLM: A General-Purpose LLM Agent Framework for Automated Signal Processing

Junlong Ke, Qiying Hu, Shenghai Yuan, Yuecong Xu, Jianfei Yang

Main category: cs.LG

TL;DR: SignalLLM is the first general-purpose LLM-based agent framework for signal processing tasks, using modular architecture to decompose goals into subtasks through in-context learning and hierarchical planning, outperforming traditional methods especially in few-shot settings.

DetailsMotivation: Traditional signal processing pipelines are constrained by complex workflows, heavy reliance on expert knowledge, and poor adaptability with limited data, while LLMs offer strong reasoning, general knowledge, and cross-modal transfer abilities.

Method: SignalLLM uses a principled modular architecture that decomposes SP goals into structured subtasks via in-context learning and domain-specific retrieval, followed by hierarchical planning through adaptive RAG and refinement, executed through prompt-based reasoning, code synthesis, and model invocation.

Result: Experimental results across five representative tasks (radar target detection, human activity recognition, text compression) show superior performance over traditional and existing LLM-based methods, particularly in few-shot and zero-shot settings.

Conclusion: SignalLLM demonstrates the potential of LLMs as powerful tools for automating and generalizing signal processing workflows, offering versatility across different signal modalities, task types, and data conditions.

Abstract: Modern signal processing (SP) pipelines, whether model-based or data-driven, often constrained by complex and fragmented workflow, rely heavily on expert knowledge and manual engineering, and struggle with adaptability and generalization under limited data. In contrast, Large Language Models (LLMs) offer strong reasoning capabilities, broad general-purpose knowledge, in-context learning, and cross-modal transfer abilities, positioning them as powerful tools for automating and generalizing SP workflows. Motivated by these potentials, we introduce SignalLLM, the first general-purpose LLM-based agent framework for general SP tasks. Unlike prior LLM-based SP approaches that are limited to narrow applications or tricky prompting, SignalLLM introduces a principled, modular architecture. It decomposes high-level SP goals into structured subtasks via in-context learning and domain-specific retrieval, followed by hierarchical planning through adaptive retrieval-augmented generation (RAG) and refinement; these subtasks are then executed through prompt-based reasoning, cross-modal reasoning, code synthesis, model invocation, or data-driven LLM-assisted modeling. Its generalizable design enables the flexible selection of problem solving strategies across different signal modalities, task types, and data conditions. We demonstrate the versatility and effectiveness of SignalLLM through five representative tasks in communication and sensing, such as radar target detection, human activity recognition, and text compression. Experimental results show superior performance over traditional and existing LLM-based methods, particularly in few-shot and zero-shot settings.

[437] Stability and Sharper Risk Bounds with Convergence Rate $\tilde{O}(1/n^2)$

Bowei Zhu, Shaojie Li, Mingyang Yi, Yong Liu

Main category: cs.LG

TL;DR: This paper establishes improved high-probability excess risk bounds of O(log²(n)/n²) for learners satisfying Polyak-Lojasiewicz condition, smoothness, and Lipschitz continuity, which is tighter than prior O(log(n)/n) bounds.

DetailsMotivation: To improve upon prior work that established O(log(n)/n) excess risk bounds via algorithmic stability for strongly-convex learners, and to provide tighter high-probability bounds for gradient-based generalization gaps in nonconvex settings.

Method: Analysis under common assumptions including Polyak-Lojasiewicz condition, smoothness, and Lipschitz continuity for losses, building on algorithmic stability approaches.

Result: Achieved improved excess risk bounds of O(log²(n)/n²), which are tighter than prior O(log(n)/n) bounds and represent the tightest known high-probability bounds for gradient-based generalization gaps in nonconvex settings.

Conclusion: The analysis demonstrates that rates of O(log²(n)/n²) are at most achievable under the stated assumptions, providing significant improvement over prior work and establishing state-of-the-art bounds for nonconvex optimization problems.

Abstract: Prior work (Klochkov $&$ Zhivotovskiy, 2021) establishes at most $O\left(\log (n)/n\right)$ excess risk bounds via algorithmic stability for strongly-convex learners with high probability. We show that under the similar common assumptions – - Polyak-Lojasiewicz condition, smoothness, and Lipschitz continous for losses – - rates of $O\left(\log^2(n)/n^2\right)$ are at most achievable. To our knowledge, our analysis also provides the tightest high-probability bounds for gradient-based generalization gaps in nonconvex settings.

[438] Revealing Multimodal Causality with Large Language Models

Jin Li, Shoujin Wang, Qi Zhang, Feng Liu, Tongliang Liu, Longbing Cao, Shui Yu, Fang Chen

Main category: cs.LG

TL;DR: MLLM-CD is a novel framework for multimodal causal discovery from unstructured data that addresses limitations of current MLLMs in identifying causal variables and handling structural ambiguities through contrastive factor discovery, statistical structure learning, and iterative counterfactual reasoning.

DetailsMotivation: To overcome the limitations of current multimodal LLMs in causal discovery, particularly their difficulty in exploring intra- and inter-modal interactions for comprehensive causal variable identification and insufficiency in handling structural ambiguities with purely observational data.

Method: Three key components: (1) contrastive factor discovery module to identify genuine multimodal factors from contrastive sample pairs, (2) statistical causal structure discovery module to infer causal relationships, and (3) iterative multimodal counterfactual reasoning module to refine discoveries using MLLMs’ world knowledge and reasoning capabilities.

Result: Extensive experiments on synthetic and real-world datasets demonstrate MLLM-CD’s effectiveness in revealing genuine factors and causal relationships from multimodal unstructured data.

Conclusion: The proposed MLLM-CD framework successfully addresses key challenges in multimodal causal discovery and shows promising results in uncovering cause-and-effect mechanisms from unstructured multimodal data.

Abstract: Uncovering cause-and-effect mechanisms from data is fundamental to scientific progress. While large language models (LLMs) show promise for enhancing causal discovery (CD) from unstructured data, their application to the increasingly prevalent multimodal setting remains a critical challenge. Even with the advent of multimodal LLMs (MLLMs), their efficacy in multimodal CD is hindered by two primary limitations: (1) difficulty in exploring intra- and inter-modal interactions for comprehensive causal variable identification; and (2) insufficiency to handle structural ambiguities with purely observational data. To address these challenges, we propose MLLM-CD, a novel framework for multimodal causal discovery from unstructured data. It consists of three key components: (1) a novel contrastive factor discovery module to identify genuine multimodal factors based on the interactions explored from contrastive sample pairs; (2) a statistical causal structure discovery module to infer causal relationships among discovered factors; and (3) an iterative multimodal counterfactual reasoning module to refine the discovery outcomes iteratively by incorporating the world knowledge and reasoning capabilities of MLLMs. Extensive experiments on both synthetic and real-world datasets demonstrate the effectiveness of the proposed MLLM-CD in revealing genuine factors and causal relationships among them from multimodal unstructured data.

[439] HoGA: Higher-Order Graph Attention via Diversity-Aware k-Hop Sampling

Thomas Bailie, Yun Sing Koh, Karthik Mukkavilli

Main category: cs.LG

TL;DR: HoGA introduces higher-order graph attention by sampling diverse subgraphs to capture complex relationships beyond traditional MPNNs, achieving significant accuracy improvements in node classification.

DetailsMotivation: Traditional edge-based MPNNs have limited expressive power for discovering higher-order relationships in graphs, creating a need for methods that can capture more complex topological patterns.

Method: HoGA constructs k-order attention matrices by sampling diverse subgraphs to maximize feature vector diversity, avoiding redundant higher-order relationships and capturing varied topological modalities.

Result: HoGA achieves at least 5% accuracy gain on all benchmark node classification datasets and outperforms recent baselines on six of eight datasets.

Conclusion: The HoGA module effectively expands the expressive power of graph neural networks by capturing diverse higher-order relationships through strategic subgraph sampling.

Abstract: Graphs model latent variable relationships in many real-world systems, and Message Passing Neural Networks (MPNNs) are widely used to learn such structures for downstream tasks. While edge-based MPNNs effectively capture local interactions, their expressive power is theoretically bounded, limiting the discovery of higher-order relationships. We introduce the Higher-Order Graph Attention (HoGA) module, which constructs a k-order attention matrix by sampling subgraphs to maximize diversity among feature vectors. Unlike existing higher-order attention methods that greedily resample similar k-order relationships, HoGA targets diverse modalities in higher-order topology, reducing redundancy and expanding the range of captured substructures. Applied to two single-hop attention models, HoGA achieves at least a 5% accuracy gain on all benchmark node classification datasets and outperforms recent baselines on six of eight datasets. Code is available at https://github.com/TB862/Higher_Order.

[440] Omni-Mol: Multitask Molecular Model for Any-to-any Modalities

Chengxin Hu, Hao Li, Yihe Yuan, Zezheng Song, Chenyang Zhao, Haixin Wang

Main category: cs.LG

TL;DR: Omni-Mol is a multimodal LLM framework that unifies small-molecule tasks by addressing dataset limitations, task competition, and representation challenges through a novel MoGE architecture, achieving SOTA on 13 out of 16 tasks.

DetailsMotivation: Current multimodal LLMs for molecular tasks fail to achieve truly universal molecular models due to small datasets, task competition causing instability, and difficulty balancing representation dimensions across different molecular task types.

Method: Categorizes molecular tasks into four types (Mol2Mol, Mol2Text, Mol2Num, Text2Mol), collects largest molecular instruction-tuning dataset (1.4M+ samples across 16 tasks), and proposes MoGE architecture that dynamically adapts to task intrinsic ranks.

Result: Achieves state-of-the-art performance on 13 out of 16 unified molecular tasks, demonstrating superior scalability and versatility through extensive experiments.

Conclusion: Omni-Mol successfully addresses key challenges in universal molecular modeling through comprehensive dataset construction and innovative MoGE architecture, enabling effective unified instruction tuning across diverse molecular tasks.

Abstract: In the molecular domain, numerous studies have explored the use of multimodal large language models (LLMs) to construct a general-purpose, multi-task molecular model. However, these efforts are still far from achieving a truly universal molecular model. We identify three key challenges in this endeavor: (1) Existing molecular task datasets are typically small in scale and lack comprehensive domain coverage. (2) Tasks from different molecular subfields are difficult to effectively learn jointly through LLMs due to significant distributional shifts and competition among tasks, which introduces instability in the learning process. (3) Both inter-task and intra-task molecular representations demand different intrinsic dimensions in the language space, making it challenging to balance between redundancy and insufficiency in language model representations. To address these challenges, we innovatively categorize existing small-molecule tasks into four types: Mol2Mol, Mol2Text, Mol2Num, and Text2Mol. We then collect a dataset encompassing over 16 tasks with more than 1.4 million samples, making it the largest molecular instruction-tuning dataset to date. Leveraging the extensive pretraining of LLMs on existing chemical literature, we propose a novel multimodal LLM framework, named Omni-Mol, which unifies all small-molecule tasks and supports both molecular generation and understanding. The core of Omni-Mol is our proposed MoGE, which dynamically adapts to the intrinsic rank of different tasks. This mixture-of-experts architecture enhances the model’s ability to handle diverse tasks and modalities effectively. Our model achieves unified instruction tuning across 16 tasks and attains state-of-the-art performance on 13 of them. Extensive experiments further demonstrate the scalability and versatility of Omni-Mol.

[441] On-the-Fly OVD Adaptation with FLAME: Few-shot Localization via Active Marginal-Samples Exploration

Yehonathan Refael, Amit Aides, Aviad Barzilai, George Leifman, Genady Beryozkin, Vered Silverman, Bolous Jaber, Tomer Shekel

Main category: cs.LG

TL;DR: A cascaded approach combining open-vocabulary detection with few-shot learning for remote sensing, using FLAME active learning for efficient adaptation.

DetailsMotivation: Open-vocabulary detection models struggle with fine-grained class distinctions in specialized domains like remote sensing, limiting practical applications like illegal fishing monitoring.

Method: Cascaded framework: zero-shot OVD model for high-recall proposals + lightweight few-shot classifier for precision refinement, with FLAME active learning for sample selection.

Result: Consistently surpasses state-of-the-art on RS benchmarks, enables instant adaptation (<1 minute) without full-model fine-tuning.

Conclusion: Establishes practical, resource-efficient framework for adapting foundation models to specific user needs in specialized domains.

Abstract: Open-vocabulary object detection (OVD) models offer remarkable flexibility by detecting objects from arbitrary text queries. However, their zero-shot performance in specialized domains like Remote Sensing (RS) is often compromised by the inherent ambiguity of natural language, limiting critical downstream applications. For instance, an OVD model may struggle to distinguish between fine-grained classes such as “fishing boat” and “yacht” since their embeddings are similar and often inseparable. This can hamper specific user goals, such as monitoring illegal fishing, by producing irrelevant detections. To address this, we propose a cascaded approach that couples the broad generalization of a large pre-trained OVD model with a lightweight few-shot classifier. Our method first employs the zero-shot model to generate high-recall object proposals. These proposals are then refined for high precision by a compact classifier trained in real-time on only a handful of user-annotated examples - drastically reducing the high costs of RS imagery annotation.The core of our framework is FLAME, a one-step active learning strategy that selects the most informative samples for training. FLAME identifies, on the fly, uncertain marginal candidates near the decision boundary using density estimation, followed by clustering to ensure sample diversity. This efficient sampling technique achieves high accuracy without costly full-model fine-tuning and enables instant adaptation, within less then a minute, which is significantly faster than state-of-the-art alternatives.Our method consistently surpasses state-of-the-art performance on RS benchmarks, establishing a practical and resource-efficient framework for adapting foundation models to specific user needs.

[442] Decoding for Punctured Convolutional and Turbo Codes: A Deep Learning Solution for Protocols Compliance

Yongli Yan, Linglong Dai

Main category: cs.LG

TL;DR: Proposes a unified LSTM-based neural decoder with puncturing-aware embedding for convolutional and Turbo codes that adapts to variable code rates and maintains protocol compatibility.

DetailsMotivation: Existing neural network-based decoding methods struggle with punctured codes, particularly in adapting to variable code rates and meeting protocol compatibility requirements.

Method: Uses LSTM-based neural decoder with puncturing-aware embedding that integrates puncturing patterns directly into the network, plus a balanced bit error rate training strategy for robustness across code lengths, rates, and channels.

Result: Extensive simulations in AWGN and Rayleigh fading channels show the proposed neural decoder outperforms conventional decoding techniques with significant improvements in decoding accuracy and robustness.

Conclusion: The proposed approach successfully addresses challenges with punctured codes by enabling seamless adaptation to different code rates while maintaining protocol compatibility through puncturing-aware embedding and balanced training.

Abstract: Neural network-based decoding methods show promise in enhancing error correction performance but face challenges with punctured codes. In particular, existing methods struggle to adapt to variable code rates or meet protocol compatibility requirements. This paper proposes a unified long short-term memory (LSTM)-based neural decoder for punctured convolutional and Turbo codes to address these challenges. The key component of the proposed LSTM-based neural decoder is puncturing-aware embedding, which integrates puncturing patterns directly into the neural network to enable seamless adaptation to different code rates. Moreover, a balanced bit error rate training strategy is designed to ensure the decoder’s robustness across various code lengths, rates, and channels. In this way, the protocol compatibility requirement can be realized. Extensive simulations in both additive white Gaussian noise (AWGN) and Rayleigh fading channels demonstrate that the proposed neural decoder outperforms conventional decoding techniques, offering significant improvements in decoding accuracy and robustness.

[443] Integrating Genomics into Multimodal EHR Foundation Models

Jonathan Amar, Edward Liu, Alessandra Breschi, Liangliang Zhang, Pouya Kheradpour, Sylvia Li, Lisa Soleymani Lehmann, Alessandro Giulianelli, Matt Edwards, Yugang Jia, David Nola, Raghav Mani, Pankaj Vats, Jesse Tetreault, T. J. Chen, Cory Y. McLean

Main category: cs.LG

TL;DR: A novel EHR foundation model that integrates Polygenic Risk Scores (PRS) with traditional EHR data to create holistic health profiles, demonstrating improved predictive capabilities for conditions like Type 2 Diabetes.

DetailsMotivation: To move beyond traditional EHR-only approaches by incorporating genetic predisposition data (PRS) for more comprehensive health profiling and better disease prediction.

Method: Multimodal framework using All of Us Research Program data, extending generative AI advancements to EHR foundation models, with transfer learning for custom classification tasks.

Result: Demonstrated predictive value for various conditions, particularly Type 2 Diabetes, and showed interplay between PRS and EHR data. The architecture proved versatile and efficient for transfer learning.

Conclusion: This integrated approach enables better disease prediction, proactive health management, risk stratification, and personalized treatment strategies, advancing personalized and equitable healthcare evidence generation.

Abstract: This paper introduces an innovative Electronic Health Record (EHR) foundation model that integrates Polygenic Risk Scores (PRS) as a foundational data modality, moving beyond traditional EHR-only approaches to build more holistic health profiles. Leveraging the extensive and diverse data from the All of Us (AoU) Research Program, this multimodal framework aims to learn complex relationships between clinical data and genetic predispositions. The methodology extends advancements in generative AI to the EHR foundation model space, enhancing predictive capabilities and interpretability. Evaluation on AoU data demonstrates the model’s predictive value for the onset of various conditions, particularly Type 2 Diabetes (T2D), and illustrates the interplay between PRS and EHR data. The work also explores transfer learning for custom classification tasks, showcasing the architecture’s versatility and efficiency. This approach is pivotal for unlocking new insights into disease prediction, proactive health management, risk stratification, and personalized treatment strategies, laying the groundwork for more personalized, equitable, and actionable real-world evidence generation in healthcare.

[444] Experiments with Optimal Model Trees

Sabino Francesco Roselli, Eibe Frank

Main category: cs.LG

TL;DR: This paper investigates globally optimal model trees using mixed-integer linear programming, comparing them against greedy approaches and other methods for classification and regression tasks.

DetailsMotivation: Traditional model trees use greedy algorithms that only find locally optimal splits, potentially leading to overly complex trees with suboptimal accuracy. The authors aim to explore globally optimal model trees for better performance.

Method: Used mixed-integer linear programming formulations to learn optimal trees with linear support vector machines at leaf nodes, tested on benchmark datasets, and compared against greedy model trees, classic decision trees, random forests, and SVMs.

Result: Optimal model trees achieved competitive accuracy with very small trees, and multivariate splits (while less interpretable) showed potential for greater accuracy compared to axis-parallel splits.

Conclusion: Globally optimal model trees can provide competitive accuracy with improved interpretability due to smaller tree sizes, offering a promising alternative to greedy approaches.

Abstract: Model trees provide an appealing way to perform interpretable machine learning for both classification and regression problems. In contrast to ``classic’’ decision trees with constant values in their leaves, model trees can use linear combinations of predictor variables in their leaf nodes to form predictions, which can help achieve higher accuracy and smaller trees. Typical algorithms for learning model trees from training data work in a greedy fashion, growing the tree in a top-down manner by recursively splitting the data into smaller and smaller subsets. Crucially, the selected splits are only locally optimal, potentially rendering the tree overly complex and less accurate than a tree whose structure is globally optimal for the training data. In this paper, we empirically investigate the effect of constructing globally optimal model trees for classification and regression with linear support vector machines at the leaf nodes. To this end, we present mixed-integer linear programming formulations to learn optimal trees, compute such trees for a large collection of benchmark data sets, and compare their performance against greedily grown model trees in terms of interpretability and accuracy. We also compare to classic optimal and greedily grown decision trees, random forests, and support vector machines. Our results show that optimal model trees can achieve competitive accuracy with very small trees. We also investigate the effect on the accuracy of replacing axis-parallel splits with multivariate ones, foregoing interpretability while potentially obtaining greater accuracy.

[445] Explainable post-training bias mitigation with distribution-based fairness metrics

Ryan Franks, Alexey Miroshnikov, Konstandinos Kotsiopoulos

Main category: cs.LG

TL;DR: A novel bias mitigation framework using distribution-based fairness constraints for creating demographically blind and explainable ML models across various fairness levels through post-processing.

DetailsMotivation: To efficiently produce fairer machine learning models without retraining, enabling demographic blindness and explainability across a wide range of fairness requirements.

Method: Post-processing framework based on stochastic gradient descent that applies distribution-based fairness constraints, particularly focused on gradient-boosted decision trees, with differentiable and consistent fairness metric estimators.

Result: The framework can be applied to various model types and was empirically tested on multiple datasets, showing comparison with alternative approaches like Bayesian search, optimal transport projection, and direct neural network training.

Conclusion: The proposed framework provides an efficient way to generate fair and explainable models through post-processing, supporting diverse fairness levels without requiring model retraining.

Abstract: We develop a novel bias mitigation framework with distribution-based fairness constraints suitable for producing demographically blind and explainable machine-learning models across a wide range of fairness levels. This is accomplished through post-processing, allowing fairer models to be generated efficiently without retraining the underlying model. Our framework, which is based on stochastic gradient descent, can be applied to a wide range of model types, with a particular emphasis on the post-processing of gradient-boosted decision trees. Additionally, we design a broad family of global fairness metrics, along with differentiable and consistent estimators compatible with our framework, building on previous work. We empirically test our methodology on a variety of datasets and compare it with alternative post-processing approaches, including Bayesian search, optimal transport projection, and direct neural network training.

[446] A Convexity-dependent Two-Phase Training Algorithm for Deep Neural Networks

Tomas Hrycej, Bernhard Bermeitinger, Massimo Pavone, Götz-Henrik Wiegand, Siegfried Handschuh

Main category: cs.LG

TL;DR: The paper proposes a two-phase optimization algorithm that leverages the transition from non-convex to convex regions in loss functions, using Adam for non-convex regions and Conjugate Gradient for convex regions to improve convergence and accuracy.

DetailsMotivation: Loss functions in machine learning often have non-convex regions, leading to widespread use of non-convex methods like Adam. However, local minima imply convex environments where second-order methods like CG can achieve superlinear convergence.

Method: A two-phase optimization framework that detects the transition point from non-convex to convex regions by observing gradient norm dependence on loss. Uses Adam in non-convex regions and Conjugate Gradient in convex regions.

Result: Computing experiments confirm that this convexity structure is frequent enough to be practically exploited, leading to substantial improvements in convergence and accuracy.

Conclusion: The proposed hybrid approach effectively leverages the natural transition from non-convex to convex regions in loss functions, providing a practical optimization strategy that combines the strengths of both Adam and Conjugate Gradient methods.

Abstract: The key task of machine learning is to minimize the loss function that measures the model fit to the training data. The numerical methods to do this efficiently depend on the properties of the loss function. The most decisive among these properties is the convexity or non-convexity of the loss function. The fact that the loss function can have, and frequently has, non-convex regions has led to a widespread commitment to non-convex methods such as Adam. However, a local minimum implies that, in some environment around it, the function is convex. In this environment, second-order minimizing methods such as the Conjugate Gradient (CG) give a guaranteed superlinear convergence. We propose a novel framework grounded in the hypothesis that loss functions in real-world tasks swap from initial non-convexity to convexity towards the optimum. This is a property we leverage to design an innovative two-phase optimization algorithm. The presented algorithm detects the swap point by observing the gradient norm dependence on the loss. In these regions, non-convex (Adam) and convex (CG) algorithms are used, respectively. Computing experiments confirm the hypothesis that this simple convexity structure is frequent enough to be practically exploited to substantially improve convergence and accuracy.

[447] Smart Exploration in Reinforcement Learning using Bounded Uncertainty Models

J. S. van Hulst, W. P. M. H. Heemels, D. J. Antunes

Main category: cs.LG

TL;DR: BUMEX is a model-based RL method that uses prior model knowledge to guide exploration, accelerating learning by optimizing over model sets to bound Q-functions and providing theoretical convergence guarantees.

DetailsMotivation: RL typically requires large amounts of data for optimal policy learning, so the paper aims to incorporate prior model knowledge to reduce data requirements and accelerate the learning process.

Method: The method assumes access to a model set containing true transition kernel and reward function, optimizes over this set to obtain Q-function bounds, uses these bounds to guide exploration, and introduces a regularized version for convergence to optimal policy.

Result: Theoretical convergence guarantees are provided, and when using the BMDP framework, the optimization becomes convex with finite-time convergence. Simulation studies show significant learning acceleration in benchmark examples.

Conclusion: BUMEX effectively accelerates RL learning by leveraging model uncertainty bounds for exploration guidance, with strong theoretical guarantees and practical implementation advantages in structured model sets.

Abstract: Reinforcement learning (RL) is a powerful framework for decision-making in uncertain environments, but it often requires large amounts of data to learn an optimal policy. We address this challenge by incorporating prior model knowledge to guide exploration and accelerate the learning process. Specifically, we assume access to a model set that contains the true transition kernel and reward function. We optimize over this model set to obtain upper and lower bounds on the Q-function, which are then used to guide the exploration of the agent. We provide theoretical guarantees on the convergence of the Q-function to the optimal Q-function under the proposed class of exploring policies. Furthermore, we also introduce a data-driven regularized version of the model set optimization problem that ensures the convergence of the class of exploring policies to the optimal policy. Lastly, we show that when the model set has a specific structure, namely the bounded-parameter MDP (BMDP) framework, the regularized model set optimization problem becomes convex and simple to implement. In this setting, we also prove finite-time convergence to the optimal policy under mild assumptions. We demonstrate the effectiveness of the proposed exploration strategy, which we call BUMEX (Bounded Uncertainty Model-based Exploration), in a simulation study. The results indicate that the proposed method can significantly accelerate learning in benchmark examples. A toolbox is available at https://github.com/JvHulst/BUMEX.

[448] Advancing Local Clustering on Graphs via Compressive Sensing: Semi-supervised and Unsupervised Methods

Zhaiming Shen, Sung Ha Kang

Main category: cs.LG

TL;DR: The paper proposes semi-supervised and unsupervised local clustering methods using graph sampling, diffusion, and overlap analysis to identify specific substructures in large graphs with few or no labeled data.

DetailsMotivation: Local clustering aims to identify specific substructures in large graphs without additional structural information, but existing methods may require more labeled data than available in practice.

Method: Randomly sample the graph, apply diffusion through local cluster extraction, then examine overlap among results to find clusters. Establishes co-membership conditions for node pairs.

Result: The methods achieve state-of-the-art results in low-label rates regime, with rigorous proofs of correctness for the proposed approaches.

Conclusion: The proposed semi-supervised and unsupervised local clustering methods effectively identify specific substructures in large graphs with minimal or no labeled data, demonstrating strong performance in low-label scenarios.

Abstract: Local clustering aims to identify specific substructures within a large graph without any additional structural information of the graph. These substructures are typically small compared to the overall graph, enabling the problem to be approached by finding a sparse solution to a linear system associated with the graph Laplacian. In this work, we first propose a method for identifying specific local clusters when very few labeled data are given, which we term semi-supervised local clustering. We then extend this approach to the unsupervised setting when no prior information on labels is available. The proposed methods involve randomly sampling the graph, applying diffusion through local cluster extraction, then examining the overlap among the results to find each cluster. We establish the co-membership conditions for any pair of nodes, and rigorously prove the correctness of our methods. Additionally, we conduct extensive experiments to demonstrate that the proposed methods achieve state of the art results in the low-label rates regime.

[449] AnomalyMatch: Discovering Rare Objects of Interest with Semi-supervised and Active Learning

Pablo Gómez, Laslo E. Ruhberg, Maria Teresa Nardone, David O’Ryan

Main category: cs.LG

TL;DR: AnomalyMatch combines semi-supervised learning with active learning for anomaly detection in large datasets, achieving high performance with minimal labeled data.

DetailsMotivation: Anomaly detection in astronomy and computer vision faces challenges due to scarcity of labeled data, making supervised methods infeasible.

Method: Treats anomaly detection as binary classification using FixMatch algorithm with EfficientNet classifiers, integrated with active learning via user interface for verification and correction.

Result: Achieved AUROC of 0.96 (miniImageNet) and 0.89 (GalaxyMNIST) with only 5-10 labeled anomalies. After 3 active learning cycles, precision reached 76-94% in top 1% highest-ranking images.

Conclusion: The approach demonstrates exceptional utility and scalability for anomaly discovery in domains with severe label scarcity, performing comparably to established methods like Astronomaly.

Abstract: Anomaly detection in large datasets is essential in astronomy and computer vision. However, due to a scarcity of labelled data, it is often infeasible to apply supervised methods to anomaly detection. We present AnomalyMatch, an anomaly detection framework combining the semi-supervised FixMatch algorithm using EfficientNet classifiers with active learning. AnomalyMatch is tailored for large-scale applications and integrated into the ESA Datalabs science platform. In this method, we treat anomaly detection as a binary classification problem and efficiently utilise limited labelled and abundant unlabelled images for training. We enable active learning via a user interface for verification of high-confidence anomalies and correction of false positives. Evaluations on the GalaxyMNIST astronomical dataset and the miniImageNet natural-image benchmark under severe class imbalance display strong performance. Starting from five to ten labelled anomalies, we achieve an average AUROC of 0.96 (miniImageNet) and 0.89 (GalaxyMNIST), with respective AUPRC of 0.82 and 0.77. After three active learning cycles, anomalies are ranked with 76% (miniImageNet) to 94% (GalaxyMNIST) precision in the top 1% of the highest-ranking images by score. We compare to the established Astronomaly software on selected ‘odd’ galaxies from the ‘Galaxy Zoo - The Galaxy Challenge’ dataset, achieving comparable performance with an average AUROC of 0.83. Our results underscore the exceptional utility and scalability of this approach for anomaly discovery, highlighting the value of specialised approaches for domains characterised by severe label scarcity.

[450] Buffer layers for Test-Time Adaptation

Hyeongyu Kim, Geonhui Han, Dosik Hwang

Main category: cs.LG

TL;DR: Proposes Buffer layer as a novel alternative to normalization-based Test Time Adaptation, addressing limitations of batch size sensitivity and domain generalization issues.

DetailsMotivation: Existing TTA methods rely on updating normalization layers, which are sensitive to small batch sizes and constrained by pre-trained model statistics that don't generalize well to unseen domains.

Method: Introduces a Buffer layer that preserves pre-trained backbone integrity while enabling adaptation, avoiding catastrophic forgetting and being modular for integration into existing TTA frameworks.

Result: Outperforms traditional methods in domain shift mitigation and model robustness, shows strong resilience to forgetting, and provides consistent performance improvements across various architectures.

Conclusion: The Buffer layer approach is effective and versatile for real-world domain adaptation, offering a fundamental improvement over normalization-based TTA methods.

Abstract: In recent advancements in Test Time Adaptation (TTA), most existing methodologies focus on updating normalization layers to adapt to the test domain. However, the reliance on normalization-based adaptation presents key challenges. First, normalization layers such as Batch Normalization (BN) are highly sensitive to small batch sizes, leading to unstable and inaccurate statistics. Moreover, normalization-based adaptation is inherently constrained by the structure of the pre-trained model, as it relies on training-time statistics that may not generalize well to unseen domains. These issues limit the effectiveness of normalization-based TTA approaches, especially under significant domain shift. In this paper, we introduce a novel paradigm based on the concept of a Buffer layer, which addresses the fundamental limitations of normalization layer updates. Unlike existing methods that modify the core parameters of the model, our approach preserves the integrity of the pre-trained backbone, inherently mitigating the risk of catastrophic forgetting during online adaptation. Through comprehensive experimentation, we demonstrate that our approach not only outperforms traditional methods in mitigating domain shift and enhancing model robustness, but also exhibits strong resilience to forgetting. Furthermore, our Buffer layer is modular and can be seamlessly integrated into nearly all existing TTA frameworks, resulting in consistent performance improvements across various architectures. These findings validate the effectiveness and versatility of the proposed solution in real-world domain adaptation scenarios. The code is available at https://github.com/hyeongyu-kim/Buffer_TTA.

[451] A Robust and Non-Iterative Tensor Decomposition Method with Automatic Thresholding

Hiroki Hasegawa, Yukihiko Okada

Main category: cs.LG

TL;DR: A novel tensor decomposition method that automatically determines tensor rank via statistical thresholding, eliminating the need for prior rank specification and iterative optimization.

DetailsMotivation: Existing tensor decomposition methods require manual rank specification and iterative optimization, leading to computational inefficiency and dependence on analyst expertise.

Method: Statistical singular value hard thresholding on mode-wise unfolded matrices using Marčenko-Pastur distribution to automatically extract statistically significant components.

Result: Outperforms conventional methods (HOSVD, HOOI, Tucker-L2E) in both estimation accuracy and computational efficiency.

Conclusion: Provides an effective, automatic, non-iterative, and analyst-independent framework for tensor decomposition with theoretical guarantees.

Abstract: Recent advances in IoT and biometric sensing technologies have led to the generation of massive and high-dimensional tensor data, yet achieving accurate and efficient low-rank approximation remains a major challenge. Existing tensor decomposition methods typically require prior specification of the tensor rank and rely on iterative optimization, which often results in heavy computational costs and dependence on the analyst’s expertise. In this study, we propose a novel low-rank approximation method for tensor data that requires neither prior rank specification nor iterative optimization. The proposed method performs statistical singular value hard thresholding on the mode-wise unfolded matrices to automatically extract only statistically significant components, thereby achieving noise reduction while preserving the intrinsic tensor structure. Theoretically, the optimal threshold for each mode is derived based on the asymptotic properties of the Mar\v{c}enko–Pastur distribution. Simulation experiments demonstrate that the proposed method outperforms conventional approaches such as Higher-Order Singular Value Decomposition, Higher-Order Orthogonal Iteration, and Tucker-L2E in terms of both estimation accuracy and computational efficiency. These results indicate that our method provides an effective and theoretically grounded framework for automatic, non-iterative, and analyst-independent tensor decomposition.

[452] Seeing Structural Failure Before it Happens: An Image-Based Physics-Informed Neural Network (PINN) for Spaghetti Bridge Load Prediction

Omer Jauhar Khan, Sudais Khan, Hafeez Anwar, Shahzeb Khan, Shams Ul Arifeen

Main category: cs.LG

TL;DR: Physics Informed Neural Networks (PINNs) with a novel PIKAN architecture predict spaghetti bridge weights using physics-based constraints, achieving R²=0.9603 and MAE=10.50 with limited data.

DetailsMotivation: To explore PINNs for predicting small-scale spaghetti bridge weights, which helps understand load limits and failure modes in simplified structural models with limited data.

Method: Proposed framework incorporates physics-based constraints, including standard PINNs and a novel Physics Informed Kolmogorov Arnold Network (PIKAN) that blends universal function approximation with physical insights. Uses structural parameters collected manually or via computer vision.

Result: Best model achieves R² score of 0.9603 and mean absolute error of 10.50 units on dataset of 15 real bridges augmented to 100 samples. Also developed web interface for parameter entry and prediction.

Conclusion: PINNs can provide reliable structural weight estimates even with limited data, potentially aiding early-stage failure analysis in lightweight bridge designs.

Abstract: Physics Informed Neural Networks (PINNs) are gaining attention for their ability to embed physical laws into deep learning models, which is particularly useful in structural engineering tasks with limited data. This paper aims to explore the use of PINNs to predict the weight of small scale spaghetti bridges, a task relevant to understanding load limits and potential failure modes in simplified structural models. Our proposed framework incorporates physics-based constraints to the prediction model for improved performance. In addition to standard PINNs, we introduce a novel architecture named Physics Informed Kolmogorov Arnold Network (PIKAN), which blends universal function approximation theory with physical insights. The structural parameters provided as input to the model are collected either manually or through computer vision methods. Our dataset includes 15 real bridges, augmented to 100 samples, and our best model achieves an $R^2$ score of 0.9603 and a mean absolute error (MAE) of 10.50 units. From applied perspective, we also provide a web based interface for parameter entry and prediction. These results show that PINNs can offer reliable estimates of structural weight, even with limited data, and may help inform early stage failure analysis in lightweight bridge designs. The complete data and code are available at https://github.com/OmerJauhar/PINNS-For-Spaghetti-Bridges.

[453] Improving the Euclidean Diffusion Generation of Manifold Data by Mitigating Score Function Singularity

Zichen Liu, Wei Zhang, Tiejun Li

Main category: cs.LG

TL;DR: The paper addresses the multiscale singularity problem in Euclidean diffusion models when applied to manifold-structured data, proposing two methods (Niso-DM and Tango-DM) to improve sampling accuracy by handling score function singularities.

DetailsMotivation: Euclidean diffusion models have limitations when applied to manifold-structured data due to score function singularities in the ambient space, which degrade sampling accuracy despite recent extensions to manifolds.

Method: The authors analyze the singularity structure of score functions by decomposing them into tangential and normal components, then propose two methods: Niso-DM (using non-isotropic noise to reduce scale discrepancies) and Tango-DM (training only the tangential component with a tangential-only loss).

Result: Numerical experiments show that both proposed methods achieve superior performance on distributions over various manifolds with complex geometries compared to standard approaches.

Conclusion: The proposed methods effectively mitigate the singularity issues in score functions for manifold-structured data, enabling more accurate sampling in diffusion models without requiring explicit manifold structure utilization.

Abstract: Euclidean diffusion models have achieved remarkable success in generative modeling across diverse domains, and they have been extended to manifold cases in recent advances. Instead of explicitly utilizing the structure of special manifolds as studied in previous works, in this paper we investigate direct sampling of the Euclidean diffusion models for general manifold-structured data. We reveal the multiscale singularity of the score function in the ambient space, which hinders the accuracy of diffusion-generated samples. We then present an elaborate theoretical analysis of the singularity structure of the score function by decomposing it along the tangential and normal directions of the manifold. To mitigate the singularity and improve the sampling accuracy, we propose two novel methods: (1) Niso-DM, which reduces the scale discrepancies in the score function by utilizing a non-isotropic noise, and (2) Tango-DM, which trains only the tangential component of the score function using a tangential-only loss function. Numerical experiments demonstrate that our methods achieve superior performance on distributions over various manifolds with complex geometries.

[454] Is Grokking a Computational Glass Relaxation?

Xiaotian Zhang, Yue Shang, Entao Yang, Ge Zhang

Main category: cs.LG

TL;DR: Grokking is interpreted as computational glass relaxation - memorization is like rapid cooling into glassy state, while generalization is slow relaxation to stable configuration. Experiments show no entropy barrier in grokking transition, challenging first-order phase transition theories. A new optimizer eliminates grokking without constraints.

DetailsMotivation: To understand neural network generalizability through the phenomenon of grokking, where networks abruptly generalize long after perfect training performance, offering insights into underlying mechanisms.

Method: Framed grokking as computational glass relaxation: NNs as physical systems with parameters as degrees of freedom and train loss as energy. Sampled Boltzmann entropy landscape as function of training loss and test accuracy using transformers on arithmetic tasks. Developed WanD optimizer based on Wang-landau molecular dynamics.

Result: Found NO entropy barrier in memorization-to-generalization transition of grokking, challenging first-order phase transition theories. Identified high-entropy advantage under grokking. WanD optimizer eliminated grokking without constraints and found high-norm generalizing solutions.

Conclusion: Grokking is not a first-order phase transition with entropy barrier. Provides counterexamples to theories attributing grokking solely to weight norm evolution. Suggests new potential ways for optimizer design based on far-from-equilibrium dynamics.

Abstract: Understanding neural network’s (NN) generalizability remains a central question in deep learning research. The special phenomenon of grokking, where NNs abruptly generalize long after the training performance reaches a near-perfect level, offers a unique window to investigate the underlying mechanisms of NNs’ generalizability. Here we propose an interpretation for grokking by framing it as a computational glass relaxation: viewing NNs as a physical system where parameters are the degrees of freedom and train loss is the system energy, we find memorization process resembles a rapid cooling of liquid into non-equilibrium glassy state at low temperature and the later generalization is like a slow relaxation towards a more stable configuration. This mapping enables us to sample NNs’ Boltzmann entropy (states of density) landscape as a function of training loss and test accuracy. Our experiments in transformers on arithmetic tasks suggests that there is NO entropy barrier in the memorization-to-generalization transition of grokking, challenging previous theory that defines grokking as a first-order phase transition. We identify a high-entropy advantage under grokking, an extension of prior work linking entropy to generalizability but much more significant. Inspired by grokking’s far-from-equilibrium nature, we develop a toy optimizer WanD based on Wang-landau molecular dynamics, which can eliminate grokking without any constraints and find high-norm generalizing solutions. This provides strictly-defined counterexamples to theory attributing grokking solely to weight norm evolution towards the Goldilocks zone and also suggests new potential ways for optimizer design.

[455] Neurosymbolic Diffusion Models

Emile van Krieken, Pasquale Minervini, Edoardo Ponti, Antonio Vergari

Main category: cs.LG

TL;DR: NeSyDMs introduce neurosymbolic diffusion models that use discrete diffusion to model dependencies between symbols, overcoming the independence assumption in standard neurosymbolic predictors and improving uncertainty quantification and out-of-distribution generalization.

DetailsMotivation: Standard neurosymbolic predictors assume conditional independence between symbols, limiting their ability to model interactions and uncertainty, leading to overconfident predictions and poor out-of-distribution generalization.

Method: NeSyDMs use discrete diffusion to model dependencies between symbols while reusing the independence assumption from standard NeSy predictors at each diffusion step, enabling scalable learning.

Result: NeSyDMs achieve state-of-the-art accuracy among neurosymbolic predictors and demonstrate strong calibration across synthetic and real-world benchmarks including visual path planning and rule-based autonomous driving.

Conclusion: NeSyDMs effectively overcome the limitations of independence assumptions in neurosymbolic predictors by modeling symbol dependencies through discrete diffusion, leading to improved accuracy and uncertainty quantification.

Abstract: Neurosymbolic (NeSy) predictors combine neural perception with symbolic reasoning to solve tasks like visual reasoning. However, standard NeSy predictors assume conditional independence between the symbols they extract, thus limiting their ability to model interactions and uncertainty - often leading to overconfident predictions and poor out-of-distribution generalisation. To overcome the limitations of the independence assumption, we introduce neurosymbolic diffusion models (NeSyDMs), a new class of NeSy predictors that use discrete diffusion to model dependencies between symbols. Our approach reuses the independence assumption from NeSy predictors at each step of the diffusion process, enabling scalable learning while capturing symbol dependencies and uncertainty quantification. Across both synthetic and real-world benchmarks - including high-dimensional visual path planning and rule-based autonomous driving - NeSyDMs achieve state-of-the-art accuracy among NeSy predictors and demonstrate strong calibration.

[456] On the creation of narrow AI: hierarchy and nonlocality of neural network skills

Eric J. Michaud, Asher Parker-Sartori, Max Tegmark

Main category: cs.LG

TL;DR: Training narrow AI models faces challenges: some skills require broad training data for hierarchical learning, and skill transfer from large models to small ones works better with pruning than distillation.

DetailsMotivation: To create efficient and safe narrow AI systems by addressing challenges in training specialized models from scratch and transferring skills from large foundation models.

Method: Two approaches: 1) Experiments on synthetic tasks to test when narrow training is possible vs requiring broad data distribution, 2) Skill transfer methods comparing pruning vs distillation, including regularization to align skills with prunable components.

Result: Found that hierarchical skills require broad training data for effective learning due to curriculum effects, and that pruning-based skill transfer outperforms distillation despite imperfect skill localization.

Conclusion: Creating narrow AI systems requires careful consideration of training data breadth for hierarchical skills and pruning-based approaches for efficient skill transfer from large models.

Abstract: We study the problem of creating strong, yet narrow, AI systems. While recent AI progress has been driven by the training of large general-purpose foundation models, the creation of smaller models specialized for narrow domains could be valuable for both efficiency and safety. In this work, we explore two challenges involved in creating such systems, having to do with basic properties of how neural networks learn and structure their representations. The first challenge regards when it is possible to train narrow models from scratch. Through experiments on a synthetic task, we find that it is sometimes necessary to train networks on a wide distribution of data to learn certain narrow skills within that distribution. This effect arises when skills depend on each other hierarchically, and training on a broad distribution introduces a curriculum which substantially accelerates learning. The second challenge regards how to transfer particular skills from large general models into small specialized models. We find that model skills are often not perfectly localized to a particular set of prunable components. However, we find that methods based on pruning can still outperform distillation. We investigate the use of a regularization objective to align desired skills with prunable components while unlearning unnecessary skills.

[457] C-LoRA: Contextual Low-Rank Adaptation for Uncertainty Estimation in Large Language Models

Amir Hossein Rahmati, Sanket Jantre, Weifeng Zhang, Yucheng Wang, Byung-Jun Yoon, Nathan M. Urban, Xiaoning Qian

Main category: cs.LG

TL;DR: C-LoRA is a novel uncertainty-aware fine-tuning method that addresses LoRA’s overconfidence in few-shot settings by developing input-contextualized lightweight modules to dynamically adapt uncertainty estimates.

DetailsMotivation: Standard LoRA produces overconfident predictions in data-scarce few-shot settings and existing uncertainty-aware approaches neglect how input characteristics affect predictive uncertainty estimates.

Method: Proposes Contextual Low-Rank Adaptation (C-LoRA) with new lightweight LoRA modules contextualized to each input data sample, incorporating data-driven contexts into parameter posteriors to dynamically adapt uncertainty estimates.

Result: Extensive experiments on LLaMA2-7B models show C-LoRA consistently outperforms state-of-the-art uncertainty-aware LoRA methods in both uncertainty quantification and model generalization. Ablation studies confirm the critical role of contextual modules in capturing sample-specific uncertainties.

Conclusion: C-LoRA sets a new standard for robust, uncertainty-aware LLM fine-tuning in few-shot regimes, is architecture-agnostic, and in principle applies beyond 7B models, though scaling to larger models remains an open problem.

Abstract: Low-Rank Adaptation (LoRA) offers a cost-effective solution for fine-tuning large language models (LLMs), but it often produces overconfident predictions in data-scarce few-shot settings. To address this issue, several classical statistical learning approaches have been repurposed for scalable uncertainty-aware LoRA fine-tuning. However, these approaches neglect how input characteristics affect the predictive uncertainty estimates. To address this limitation, we propose Contextual Low-Rank Adaptation (C-LoRA) as a novel uncertainty-aware and parameter efficient fine-tuning approach, by developing new lightweight LoRA modules contextualized to each input data sample to dynamically adapt uncertainty estimates. Incorporating data-driven contexts into the parameter posteriors, C-LoRA mitigates overfitting, achieves well-calibrated uncertainties, and yields robust predictions. Extensive experiments on LLaMA2-7B models demonstrate that C-LoRA consistently outperforms the state-of-the-art uncertainty-aware LoRA methods in both uncertainty quantification and model generalization. Ablation studies further confirm the critical role of our contextual modules in capturing sample-specific uncertainties. C-LoRA sets a new standard for robust, uncertainty-aware LLM fine-tuning in few-shot regimes. Although our experiments are limited to 7B models, our method is architecture-agnostic and, in principle, applies beyond this scale; studying its scaling to larger models remains an open problem. Our code is available at https://github.com/ahra99/c_lora.

[458] Oryx: a Scalable Sequence Model for Many-Agent Coordination in Offline MARL

Claude Formanek, Omayma Mahjoub, Louay Ben Nessir, Sasha Abramowitz, Ruan de Kock, Wiem Khlifi, Daniel Rajaonarivonivelomanantsoa, Simon Du Toit, Arnol Fokam, Siddarth Singh, Ulrich Mbou Sob, Felix Chalumeau, Arnu Pretorius

Main category: cs.LG

TL;DR: Oryx is a novel offline multi-agent reinforcement learning algorithm that combines retention-based architecture with sequential implicit constraint Q-learning to achieve effective many-agent coordination while maintaining temporal coherence.

DetailsMotivation: Addressing the key challenge of achieving effective many-agent multi-step coordination in complex environments for offline multi-agent reinforcement learning.

Method: Adapts retention-based architecture Sable and combines it with sequential implicit constraint Q-learning to develop an offline autoregressive policy update scheme.

Result: Achieves state-of-the-art performance on more than 80% of 65 tested datasets across SMAC, RWARE, and Multi-Agent MuJoCo benchmarks, outperforming prior methods and demonstrating robust generalization.

Conclusion: Oryx effectively solves complex coordination challenges and scales well in many-agent settings, introducing superior capabilities for offline MARL.

Abstract: A key challenge in offline multi-agent reinforcement learning (MARL) is achieving effective many-agent multi-step coordination in complex environments. In this work, we propose Oryx, a novel algorithm for offline cooperative MARL to directly address this challenge. Oryx adapts the recently proposed retention-based architecture Sable and combines it with a sequential form of implicit constraint Q-learning (ICQ), to develop a novel offline autoregressive policy update scheme. This allows Oryx to solve complex coordination challenges while maintaining temporal coherence over long trajectories. We evaluate Oryx across a diverse set of benchmarks from prior works – SMAC, RWARE, and Multi-Agent MuJoCo – covering tasks of both discrete and continuous control, varying in scale and difficulty. Oryx achieves state-of-the-art performance on more than 80% of the 65 tested datasets, outperforming prior offline MARL methods and demonstrating robust generalisation across domains with many agents and long horizons. Finally, we introduce new datasets to push the limits of many-agent coordination in offline MARL, and demonstrate Oryx’s superior ability to scale effectively in such settings.

[459] Rethinking Neural Combinatorial Optimization for Vehicle Routing Problems with Different Constraint Tightness Degrees

Fu Luo, Yaoxin Wu, Zhi Zheng, Zhenkun Wang

Main category: cs.LG

TL;DR: The paper analyzes neural combinatorial optimization methods’ performance under different constraint tightness levels in vehicle routing problems, identifies overfitting issues, and proposes a multi-expert training scheme to improve adaptability.

DetailsMotivation: Existing neural combinatorial optimization methods are trained and tested with fixed constraint values, lacking research on how constraint tightness affects performance, leading to overfitting and poor generalization.

Method: Developed an efficient training scheme that explicitly considers varying constraint tightness degrees and proposed a multi-expert module to learn generally adaptable solving strategies.

Result: The proposed method effectively overcomes overfitting issues and demonstrates superior performance on CVRP and CVRPTW with various constraint tightness degrees compared to existing methods.

Conclusion: The multi-expert training scheme successfully addresses the constraint overfitting problem in neural combinatorial optimization, enabling better performance across different constraint tightness levels.

Abstract: Recent neural combinatorial optimization (NCO) methods have shown promising problem-solving ability without requiring domain-specific expertise. Most existing NCO methods use training and testing data with a fixed constraint value and lack research on the effect of constraint tightness on the performance of NCO methods. This paper takes the capacity-constrained vehicle routing problem (CVRP) as an example to empirically analyze the NCO performance under different tightness degrees of the capacity constraint. Our analysis reveals that existing NCO methods overfit the capacity constraint, and they can only perform satisfactorily on a small range of the constraint values but poorly on other values. To tackle this drawback of existing NCO methods, we develop an efficient training scheme that explicitly considers varying degrees of constraint tightness and proposes a multi-expert module to learn a generally adaptable solving strategy. Experimental results show that the proposed method can effectively overcome the overfitting issue, demonstrating superior performances on the CVRP and CVRP with time windows (CVRPTW) with various constraint tightness degrees.

[460] Improving Generalization of Neural Combinatorial Optimization for Vehicle Routing Problems via Test-Time Projection Learning

Yuanyao Chen, Rongsheng Chen, Fu Luo, Zhenkun Wang

Main category: cs.LG

TL;DR: A novel LLM-driven framework that enables NCO models trained on small-scale VRPs to generalize effectively to large-scale problems (up to 100K nodes) by learning distribution projections during inference without retraining.

DetailsMotivation: Existing Neural Combinatorial Optimization methods trained on small instances (e.g., 100 nodes) suffer significant performance degradation on large-scale VRPs due to distributional shift between training and testing data.

Method: Introduces a Large Language Model-driven framework that learns a projection between training and testing distributions, deployed exclusively during inference phase to enhance NCO model scalability without requiring retraining.

Result: Enables backbone models trained on 100-node instances to achieve superior performance on large-scale TSP and CVRP problems up to 100K nodes from diverse distributions.

Conclusion: The proposed LLM-driven framework effectively bridges the distribution gap between small-scale training and large-scale testing, providing a practical solution for scaling NCO methods without the computational cost of retraining.

Abstract: Neural Combinatorial Optimization (NCO) has emerged as a promising learning-based paradigm for addressing Vehicle Routing Problems (VRPs) by minimizing the need for extensive manual engineering. While existing NCO methods, trained on small-scale instances (e.g., 100 nodes), have demonstrated considerable success on problems of similar scale, their performance significantly degrades when applied to large-scale scenarios. This degradation arises from the distributional shift between training and testing data, rendering policies learned on small instances ineffective for larger problems. To overcome this limitation, we introduce a novel learning framework driven by Large Language Models (LLMs). This framework learns a projection between the training and testing distributions, which is then deployed to enhance the scalability of the NCO model. Notably, unlike prevailing techniques that necessitate joint training with the neural network, our approach operates exclusively during the inference phase, obviating the need for model retraining. Extensive experiments demonstrate that our method enables a backbone model (trained on 100-node instances) to achieve superior performance on large-scale Traveling Salesman Problem (TSP) and Capacitated Vehicle Routing Problem (CVRP) of up to 100K nodes from diverse distributions.

[461] MTL-KD: Multi-Task Learning Via Knowledge Distillation for Generalizable Neural Vehicle Routing Solver

Yuepeng Zheng, Fu Luo, Zhenkun Wang, Yaoxin Wu, Yu Zhou

Main category: cs.LG

TL;DR: A multi-task learning method using knowledge distillation (MTL-KD) to train heavy decoder models for solving multiple Vehicle Routing Problem variants with strong generalization to large-scale problems.

DetailsMotivation: Existing RL-based multi-task methods can only train light decoder models on small-scale problems with limited generalization ability for large-scale VRPs.

Method: Proposes MTL-KD that transfers policy knowledge from multiple single-task RL models to a single heavy decoder model, plus Random Reordering Re-Construction (R3C) inference strategy for diverse VRP tasks.

Result: Achieves superior performance on 6 seen and 10 unseen VRP variants with up to 1000 nodes, showing robust generalization on both uniform and real-world benchmarks.

Conclusion: The MTL-KD method effectively overcomes limitations of existing multi-task approaches and demonstrates strong generalization capabilities across diverse VRP tasks.

Abstract: Multi-Task Learning (MTL) in Neural Combinatorial Optimization (NCO) is a promising approach to train a unified model capable of solving multiple Vehicle Routing Problem (VRP) variants. However, existing Reinforcement Learning (RL)-based multi-task methods can only train light decoder models on small-scale problems, exhibiting limited generalization ability when solving large-scale problems. To overcome this limitation, this work introduces a novel multi-task learning method driven by knowledge distillation (MTL-KD), which enables the efficient training of heavy decoder models with strong generalization ability. The proposed MTL-KD method transfers policy knowledge from multiple distinct RL-based single-task models to a single heavy decoder model, facilitating label-free training and effectively improving the model’s generalization ability across diverse tasks. In addition, we introduce a flexible inference strategy termed Random Reordering Re-Construction (R3C), which is specifically adapted for diverse VRP tasks and further boosts the performance of the multi-task model. Experimental results on 6 seen and 10 unseen VRP variants with up to 1000 nodes indicate that our proposed method consistently achieves superior performance on both uniform and real-world benchmarks, demonstrating robust generalization abilities.

[462] AANet: Virtual Screening under Structural Uncertainty via Alignment and Aggregation

Wenyu Zhu, Jianhui Wang, Bowen Gao, Yinjun Jia, Haichuan Tan, Ya-Qin Zhang, Wei-Ying Ma, Yanyan Lan

Main category: cs.LG

TL;DR: AANet is a virtual screening framework that improves drug discovery by handling apo structures and structural uncertainty through alignment-and-aggregation, achieving significant performance gains over existing methods.

DetailsMotivation: Most virtual screening methods rely on holo protein structures with known ligand-bound pockets, but real-world drug discovery often involves apo structures where pocket information is missing, leading to performance degradation.

Method: Uses a tri-modal contrastive learning module to align ligand, holo pocket, and cavity representations, plus a cross-attention adapter to dynamically aggregate candidate binding sites without precise pocket annotations.

Result: Achieved dramatic improvement in blind apo setting, increasing early enrichment factor (EF1%) from 11.75 to 37.19, while maintaining strong performance on holo structures.

Conclusion: The approach shows promise for advancing first-in-class drug discovery, especially in scenarios lacking experimentally resolved protein-ligand complexes.

Abstract: Virtual screening (VS) is a critical component of modern drug discovery, yet most existing methods–whether physics-based or deep learning-based–are developed around holo protein structures with known ligand-bound pockets. Consequently, their performance degrades significantly on apo or predicted structures such as those from AlphaFold2, which are more representative of real-world early-stage drug discovery, where pocket information is often missing. In this paper, we introduce an alignment-and-aggregation framework to enable accurate virtual screening under structural uncertainty. Our method comprises two core components: (1) a tri-modal contrastive learning module that aligns representations of the ligand, the holo pocket, and cavities detected from structures, thereby enhancing robustness to pocket localization error; and (2) a cross-attention based adapter for dynamically aggregating candidate binding sites, enabling the model to learn from activity data even without precise pocket annotations. We evaluated our method on a newly curated benchmark of apo structures, where it significantly outperforms state-of-the-art methods in blind apo setting, improving the early enrichment factor (EF1%) from 11.75 to 37.19. Notably, it also maintains strong performance on holo structures. These results demonstrate the promise of our approach in advancing first-in-class drug discovery, particularly in scenarios lacking experimentally resolved protein-ligand complexes. Our implementation is publicly available at https://github.com/Wiley-Z/AANet.

[463] Mind the Gap: Removing the Discretization Gap in Differentiable Logic Gate Networks

Shakir Yousefi, Andreas Plesner, Till Aczel, Roger Wattenhofer

Main category: cs.LG

TL;DR: The paper proposes using Gumbel noise with straight-through estimator to accelerate logic gate networks training, reduce discretization gap, and improve neuron utilization.

DetailsMotivation: Logic gate networks (LGNs) are efficient for image classification but suffer from slow training (days to weeks), high unused network components, and significant discretization gap that hinders real-world deployment.

Method: Inject Gumbel noise with straight-through estimator during training, which provides implicit Hessian regularization to improve convergence properties of LGNs.

Result: 4.5x faster training in wall-clock time, 98% reduction in discretization gap, and 100% reduction in unused gates.

Conclusion: The proposed method enables practical deployment of LGNs by significantly improving training efficiency and reducing performance drop between training and inference.

Abstract: Modern neural networks demonstrate state-of-the-art performance on numerous existing benchmarks; however, their high computational requirements and energy consumption prompt researchers to seek more efficient solutions for real-world deployment. Logic gate networks (LGNs) learns a large network of logic gates for efficient image classification. However, learning a network that can solve a simple problem like CIFAR-10 can take days to weeks to train. Even then, almost half of the network remains unused, causing a discretization gap. This discretization gap hinders real-world deployment of LGNs, as the performance drop between training and inference negatively impacts accuracy. We inject Gumbel noise with a straight-through estimator during training to significantly speed up training, improve neuron utilization, and decrease the discretization gap. We theoretically show that this results from implicit Hessian regularization, which improves the convergence properties of LGNs. We train networks $4.5 \times$ faster in wall-clock time, reduce the discretization gap by $98%$, and reduce the number of unused gates by $100%$.

[464] When Kernels Multiply, Clusters Unify: Fusing Embeddings with the Kronecker Product

Youqi Wu, Jingwei Zhang, Farzan Farnia

Main category: cs.LG

TL;DR: KrossFuse is a framework that fuses complementary embeddings through kernel multiplication, enabling integration of cross-modal and unimodal models while preserving cross-modal alignment.

DetailsMotivation: Different embedding models capture complementary features (e.g., texture vs structure), and there's a performance gap between cross-modal embeddings and unimodal experts that needs bridging.

Method: Uses kernel multiplication via Kronecker product of embedding feature maps, with RP-KrossFuse variant using random projections for scalable approximation.

Result: Effectively integrates cross-modal and unimodal models, enhancing modality-specific performance while preserving cross-modal alignment.

Conclusion: Kernel multiplication provides a principled approach for embedding fusion, successfully bridging the gap between cross-modal and unimodal models.

Abstract: State-of-the-art embeddings often capture distinct yet complementary discriminative features: For instance, one image embedding model may excel at distinguishing fine-grained textures, while another focuses on object-level structure. Motivated by this observation, we propose a principled approach to fuse such complementary representations through kernel multiplication. Multiplying the kernel similarity functions of two embeddings allows their discriminative structures to interact, producing a fused representation whose kernel encodes the union of the clusters identified by each parent embedding. This formulation also provides a natural way to construct joint kernels for paired multi-modal data (e.g., image-text tuples), where the product of modality-specific kernels inherits structure from both domains. We highlight that this kernel product is mathematically realized via the Kronecker product of the embedding feature maps, yielding our proposed KrossFuse framework for embedding fusion. To address the computational cost of the resulting high-dimensional Kronecker space, we further develop RP-KrossFuse, a scalable variant that leverages random projections for efficient approximation. As a key application, we use this framework to bridge the performance gap between cross-modal embeddings (e.g., CLIP, BLIP) and unimodal experts (e.g., DINOv2, E5). Experiments show that RP-KrossFuse effectively integrates these models, enhancing modality-specific performance while preserving cross-modal alignment. The project code is available at https://github.com/yokiwuuu/KrossFuse.

[465] A geometric framework for momentum-based optimizers for low-rank training

Steffen Schotthöfer, Timon Klein, Jonas Kusch

Main category: cs.LG

TL;DR: Training low-rank neural networks with classical momentum methods faces convergence issues due to geometric constraints. The paper introduces new optimizers combining dynamical low-rank approximation with momentum to respect parameter space geometry, achieving faster convergence and better performance.

DetailsMotivation: Low-rank pre-training and fine-tuning reduce computational costs but classical optimizers like momentum methods struggle with convergence due to the geometric structure of low-rank parameterizations.

Method: Developed novel training strategies using dynamical low-rank approximation combined with momentum-based optimization that explicitly accounts for the geometric structure of the parameter space.

Result: Numerical experiments show faster convergence and stronger validation metrics at given parameter budgets compared to conventional optimizers.

Conclusion: The proposed geometric-aware optimizers effectively address convergence issues in low-rank training and outperform traditional momentum methods.

Abstract: Low-rank pre-training and fine-tuning have recently emerged as promising techniques for reducing the computational and storage costs of large neural networks. Training low-rank parameterizations typically relies on conventional optimizers such as heavy ball momentum methods or Adam. In this work, we identify and analyze potential difficulties that these training methods encounter when used to train low-rank parameterizations of weights. In particular, we show that classical momentum methods can struggle to converge to a local optimum due to the geometry of the underlying optimization landscape. To address this, we introduce novel training strategies derived from dynamical low-rank approximation, which explicitly account for the underlying geometric structure. Our approach leverages and combines tools from dynamical low-rank approximation and momentum-based optimization to design optimizers that respect the intrinsic geometry of the parameter space. We validate our methods through numerical experiments, demonstrating faster convergence, and stronger validation metrics at given parameter budgets.

Antonis Vasileiou, Timo Stoll, Christopher Morris

Main category: cs.LG

TL;DR: A unified framework for analyzing generalization properties of message-passing graph neural networks (MPNNs) in node and link prediction tasks, addressing limitations of existing approaches and incorporating graph structure influence.

DetailsMotivation: MPNNs are widely used for node and link prediction but their generalization capabilities remain poorly understood, especially for node- and link-level predictions where existing works make unrealistic i.i.d. assumptions and overlook graph structure influence.

Method: Developed a unified theoretical framework to analyze MPNN generalization in inductive and transductive settings, incorporating diverse architectural parameters, loss functions, and quantifying graph structure influence.

Result: The proposed framework provides theoretical insights into MPNN generalization and is supported by empirical studies, showing applicability beyond graphs to any classification task under inductive or transductive settings.

Conclusion: The work deepens understanding of MPNNs’ generalization capabilities in node and link prediction tasks through a comprehensive framework that addresses limitations of previous approaches and accounts for graph structure influence.

Abstract: Using message-passing graph neural networks (MPNNs) for node and link prediction is crucial in various scientific and industrial domains, which has led to the development of diverse MPNN architectures. Besides working well in practical settings, their ability to generalize beyond the training set remains poorly understood. While some studies have explored MPNNs’ generalization in graph-level prediction tasks, much less attention has been given to node- and link-level predictions. Existing works often rely on unrealistic i.i.d.@ assumptions, overlooking possible correlations between nodes or links, and assuming fixed aggregation and impractical loss functions while neglecting the influence of graph structure. In this work, we introduce a unified framework to analyze the generalization properties of MPNNs in inductive and transductive node and link prediction settings, incorporating diverse architectural parameters and loss functions and quantifying the influence of graph structure. Additionally, our proposed generalization framework can be applied beyond graphs to any classification task under the inductive or transductive setting. Our empirical study supports our theoretical insights, deepening our understanding of MPNNs’ generalization capabilities in these tasks.

[467] LLM-Driven Treatment Effect Estimation Under Inference Time Text Confounding

Yuchen Ma, Dennis Frauen, Jonas Schweisthal, Stefan Feuerriegel

Main category: cs.LG

TL;DR: Paper addresses treatment effect estimation bias from training-inference data discrepancy, proposes LLM-based framework to mitigate inference time text confounding.

DetailsMotivation: Treatment effect models face bias when trained on complete medical data but used with incomplete text descriptions at inference, creating text confounding issues.

Method: Proposes framework using large language models with custom doubly robust learner to handle partial confounder information in text at inference time.

Result: Experiments demonstrate effectiveness in mitigating bias from inference time text confounding in real-world applications.

Conclusion: The framework successfully addresses treatment effect estimation bias caused by training-inference data discrepancy through LLM integration and doubly robust learning.

Abstract: Estimating treatment effects is crucial for personalized decision-making in medicine, but this task faces unique challenges in clinical practice. At training time, models for estimating treatment effects are typically trained on well-structured medical datasets that contain detailed patient information. However, at inference time, predictions are often made using textual descriptions (e.g., descriptions with self-reported symptoms), which are incomplete representations of the original patient information. In this work, we make three contributions. (1) We show that the discrepancy between the data available during training time and inference time can lead to biased estimates of treatment effects. We formalize this issue as an inference time text confounding problem, where confounders are fully observed during training time but only partially available through text at inference time. (2) To address this problem, we propose a novel framework for estimating treatment effects that explicitly accounts for inference time text confounding. Our framework leverages large language models together with a custom doubly robust learner to mitigate biases caused by the inference time text confounding. (3) Through a series of experiments, we demonstrate the effectiveness of our framework in real-world applications.

[468] Generalized Linear Bandits: Almost Optimal Regret with One-Pass Update

Yu-Jie Zhang, Sheng-An Xu, Peng Zhao, Masashi Sugiyama

Main category: cs.LG

TL;DR: Proposes a jointly efficient algorithm for generalized linear bandits that achieves near-optimal regret with O(1) time and space per round, using a tight confidence set for online mirror descent.

DetailsMotivation: Generalized linear bandits extend classical linear models with non-linear link functions but face computational-statistical tradeoffs; existing methods either have high computational costs or suboptimal regret.

Method: Uses online mirror descent estimator with a tight confidence set derived through mix loss analysis from online prediction, enabling one-pass updates with statistical efficiency comparable to maximum likelihood estimation.

Result: Achieves nearly optimal regret bound with constant time and space complexities per round, providing joint computational and statistical efficiency.

Conclusion: The proposed method successfully resolves the computational-statistical tradeoff in generalized linear bandits through a novel analysis of online mirror descent.

Abstract: We study the generalized linear bandit (GLB) problem, a contextual multi-armed bandit framework that extends the classical linear model by incorporating a non-linear link function, thereby modeling a broad class of reward distributions such as Bernoulli and Poisson. While GLBs are widely applicable to real-world scenarios, their non-linear nature introduces significant challenges in achieving both computational and statistical efficiency. Existing methods typically trade off between two objectives, either incurring high per-round costs for optimal regret guarantees or compromising statistical efficiency to enable constant-time updates. In this paper, we propose a jointly efficient algorithm that attains a nearly optimal regret bound with $\mathcal{O}(1)$ time and space complexities per round. The core of our method is a tight confidence set for the online mirror descent (OMD) estimator, which is derived through a novel analysis that leverages the notion of mix loss from online prediction. The analysis shows that our OMD estimator, even with its one-pass updates, achieves statistical efficiency comparable to maximum likelihood estimation, thereby leading to a jointly efficient optimistic method.

[469] GFlowNets for Learning Better Drug-Drug Interaction Representations

Azmine Toushik Wasi

Main category: cs.LG

TL;DR: A framework combining Generative Flow Networks (GFlowNet) with Variational Graph Autoencoders (VGAE) to address severe class imbalance in drug-drug interaction prediction by generating synthetic samples for rare interaction types.

DetailsMotivation: Drug-drug interactions face severe class imbalance where common interactions dominate datasets while rare but critical interactions are underrepresented, leading to poor model performance on infrequent cases. Existing methods treat DDI prediction as binary problems, ignoring class-specific nuances and exacerbating bias toward frequent interactions.

Method: Proposed framework combines Generative Flow Networks (GFlowNet) with Variational Graph Autoencoders (VGAE) to generate synthetic samples for rare classes, improving model balance and generating effective and novel DDI pairs.

Result: The approach enhances predictive performance across interaction types, ensuring better clinical reliability.

Conclusion: The GFlowNet-VGAE framework effectively addresses class imbalance in DDI prediction by generating synthetic rare interaction samples, leading to improved model performance and clinical reliability.

Abstract: Drug-drug interactions pose a significant challenge in clinical pharmacology, with severe class imbalance among interaction types limiting the effectiveness of predictive models. Common interactions dominate datasets, while rare but critical interactions remain underrepresented, leading to poor model performance on infrequent cases. Existing methods often treat DDI prediction as a binary problem, ignoring class-specific nuances and exacerbating bias toward frequent interactions. To address this, we propose a framework combining Generative Flow Networks (GFlowNet) with Variational Graph Autoencoders (VGAE) to generate synthetic samples for rare classes, improving model balance and generate effective and novel DDI pairs. Our approach enhances predictive performance across interaction types, ensuring better clinical reliability.

[470] Distributed optimization: designed for federated learning

Wenyou Guo, Ting Qu, Chunrong Pan, George Q. Huang

Main category: cs.LG

TL;DR: Proposes distributed optimization algorithms using augmented Lagrangian technique for federated learning, supporting various communication topologies with theoretical convergence guarantees and strong performance in heterogeneous settings.

DetailsMotivation: Address the need for efficient distributed optimization methods in federated learning that can handle diverse communication topologies while preserving privacy in cross-organizational data collaboration.

Method: Develops augmented Lagrangian-based distributed optimization algorithms with multiple termination criteria and parameter update mechanisms, incorporating proximal relaxation and quadratic approximation to generalize classical optimization methods.

Result: Numerical experiments show strong performance in large-scale settings with significant statistical heterogeneity across clients, with rigorous theoretical convergence guarantees.

Conclusion: The proposed framework provides a unified approach that systematically recovers classical optimization methods while offering enhanced computational efficiency and theoretical guarantees for federated learning applications.

Abstract: Federated learning (FL), as a distributed collaborative machine learning (ML) framework under privacy-preserving constraints, has garnered increasing research attention in cross-organizational data collaboration scenarios. This paper proposes a class of distributed optimization algorithms based on the augmented Lagrangian technique, designed to accommodate diverse communication topologies in both centralized and decentralized FL settings. Furthermore, we develop multiple termination criteria and parameter update mechanisms to enhance computational efficiency, accompanied by rigorous theoretical guarantees of convergence. By generalizing the augmented Lagrangian relaxation through the incorporation of proximal relaxation and quadratic approximation, our framework systematically recovers a broad of classical unconstrained optimization methods, including proximal algorithm, classic gradient descent, and stochastic gradient descent, among others. Notably, the convergence properties of these methods can be naturally derived within the proposed theoretical framework. Numerical experiments demonstrate that the proposed algorithm exhibits strong performance in large-scale settings with significant statistical heterogeneity across clients.

[471] RelP: Faithful and Efficient Circuit Discovery in Language Models via Relevance Patching

Farnoush Rezaei Jafari, Oliver Eberle, Ashkan Khakzar, Neel Nanda

Main category: cs.LG

TL;DR: Relevance Patching (RelP) is introduced as a more faithful and efficient alternative to attribution patching for mechanistic interpretability, using Layer-wise Relevance Propagation coefficients instead of gradients.

DetailsMotivation: Activation patching is computationally expensive at scale, while attribution patching suffers from noise and reduced reliability in deep non-linear networks.

Method: RelP replaces local gradients in attribution patching with propagation coefficients from Layer-wise Relevance Propagation (LRP), maintaining computational efficiency with only two forward passes and one backward pass.

Result: RelP more accurately approximates activation patching than standard attribution patching, achieving Pearson correlation of 0.956 vs 0.006 for MLP outputs in GPT-2 Large on IOI task, and achieves comparable faithfulness to Integrated Gradients without extra computational cost.

Conclusion: RelP provides a computationally efficient and faithful method for mechanistic interpretability that outperforms attribution patching and matches Integrated Gradients’ faithfulness with lower cost.

Abstract: Activation patching is a standard method in mechanistic interpretability for localizing the components of a model responsible for specific behaviors, but it is computationally expensive to apply at scale. Attribution patching offers a faster, gradient-based approximation, yet suffers from noise and reduced reliability in deep, highly non-linear networks. In this work, we introduce Relevance Patching (RelP), which replaces the local gradients in attribution patching with propagation coefficients derived from Layer-wise Relevance Propagation (LRP). LRP propagates the network’s output backward through the layers, redistributing relevance to lower-level components according to local propagation rules that ensure properties such as relevance conservation or improved signal-to-noise ratio. Like attribution patching, RelP requires only two forward passes and one backward pass, maintaining computational efficiency while improving faithfulness. We validate RelP across a range of models and tasks, showing that it more accurately approximates activation patching than standard attribution patching, particularly when analyzing residual stream and MLP outputs in the Indirect Object Identification (IOI) task. For instance, for MLP outputs in GPT-2 Large, attribution patching achieves a Pearson correlation of 0.006, whereas RelP reaches 0.956, highlighting the improvement offered by RelP. Additionally, we compare the faithfulness of sparse feature circuits identified by RelP and Integrated Gradients (IG), showing that RelP achieves comparable faithfulness without the extra computational cost associated with IG.

[472] FEDONet : Fourier-Embedded DeepONet for Spectrally Accurate Operator Learning

Arth Sojitra, Mrigank Dhingra, Omer San

Main category: cs.LG

TL;DR: Fourier-embedded DeepONet (FEDONet) enhances traditional DeepONets by incorporating Fourier feature mappings in trunk networks, achieving 2-3× better accuracy in PDE solution learning.

DetailsMotivation: Standard DeepONets with fully connected linear layers struggle to capture complex spatial structures in PDEs, limiting their performance in operator learning tasks.

Method: Introduce Fourier-embedded trunk networks using random Fourier feature mappings to enrich spatial representation capabilities within the DeepONet architecture.

Result: FEDONet shows superior performance across multiple PDE datasets (Poisson, Burgers’, Lorenz-63, Eikonal, Allen-Cahn, Kuramoto-Sivashinsky) with 2-3× average relative L² performance gains over baseline DeepONet.

Conclusion: Fourier embeddings effectively enhance neural operator learning, providing a robust methodology for PDE surrogate modeling with broad applicability.

Abstract: Deep Operator Networks (DeepONets) have recently emerged as powerful data-driven frameworks for learning nonlinear operators, particularly suited for approximating solutions to partial differential equations. Despite their promising capabilities, the standard implementation of DeepONets, which typically employs fully connected linear layers in the trunk network, can encounter limitations in capturing complex spatial structures inherent to various PDEs. To address this limitation, we introduce Fourier-embedded trunk networks within the DeepONet architecture, leveraging random Fourier feature mappings to enrich spatial representation capabilities. Our proposed Fourier-embedded DeepONet (FEDONet) demonstrates superior performance compared to the traditional DeepONet across a comprehensive suite of PDE-driven datasets, including the two-dimensional Poisson, Burgers’, Lorenz-63, Eikonal, Allen-Cahn, and the Kuramoto-Sivashinsky equation. Empirical evaluations of FEDONet consistently show significant improvements in solution reconstruction accuracy, with average relative $L^2$ performance gains ranging between 2-3$\times$ compared to the DeepONet baseline. This study highlights the effectiveness of Fourier embeddings in enhancing neural operator learning, offering a robust and broadly applicable methodology for PDE surrogate modeling.

[473] Adversarial generalization of unfolding (model-based) networks

Vicky Kouni

Main category: cs.LG

TL;DR: This paper provides the first theoretical analysis of adversarial generalization for unfolding networks, deriving tight error bounds and demonstrating that overparameterization can enhance robustness against adversarial attacks.

DetailsMotivation: Unfolding networks are used in critical applications like medical imaging and cryptography, but their adversarial robustness is not well understood theoretically. The paper aims to establish theoretical foundations for understanding how these networks perform under adversarial attacks.

Method: The authors study state-of-the-art overparameterized unfolding networks and deploy a new framework to estimate their adversarial Rademacher complexity. They provide adversarial generalization error bounds for networks under l2-norm constrained attacks generated by the fast gradient sign method.

Result: The paper provides the first theoretical analysis on adversarial generalization of unfolding networks, with tight error bounds that respect the attack level. Experiments on real-world data consistently corroborate the derived theory across all datasets.

Conclusion: The study demonstrates that overparameterization in unfolding networks can be exploited to promote adversarial robustness, providing insights into how to efficiently robustify neural networks against adversarial attacks.

Abstract: Unfolding networks are interpretable networks emerging from iterative algorithms, incorporate prior knowledge of data structure, and are designed to solve inverse problems like compressed sensing, which deals with recovering data from noisy, missing observations. Compressed sensing finds applications in critical domains, from medical imaging to cryptography, where adversarial robustness is crucial to prevent catastrophic failures. However, a solid theoretical understanding of the performance of unfolding networks in the presence of adversarial attacks is still in its infancy. In this paper, we study the adversarial generalization of unfolding networks when perturbed with $l_2$-norm constrained attacks, generated by the fast gradient sign method. Particularly, we choose a family of state-of-the-art overaparameterized unfolding networks and deploy a new framework to estimate their adversarial Rademacher complexity. Given this estimate, we provide adversarial generalization error bounds for the networks under study, which are tight with respect to the attack level. To our knowledge, this is the first theoretical analysis on the adversarial generalization of unfolding networks. We further present a series of experiments on real-world data, with results corroborating our derived theory, consistently for all data. Finally, we observe that the family’s overparameterization can be exploited to promote adversarial robustness, shedding light on how to efficiently robustify neural networks.

[474] PepCompass: Navigating peptide embedding spaces using Riemannian Geometry

Marcin Możejko, Adam Bielecki, Jurand Prądzyński, Marcin Traskowski, Antoni Janowski, Hyun-Su Lee, Marcelo Der Torossian Torres, Michał Kmicikiewicz, Paulina Szymczak, Karol Jurasz, Michał Kucharczyk, Cesar de la Fuente-Nunez, Ewa Szczurek

Main category: cs.LG

TL;DR: PepCompass is a geometry-aware framework for antimicrobial peptide discovery that uses Riemannian manifolds to better capture peptide space geometry, enabling more efficient exploration and optimization through local sampling methods and geodesic search.

DetailsMotivation: Current generative models for antimicrobial peptide discovery use flat Euclidean metrics that distort the true geometry of peptide space, making exploration and optimization inefficient. The astronomical size of peptide space and scarcity of active peptides further complicate discovery.

Method: PepCompass introduces Union of κ-Stable Riemannian Manifolds to capture local geometry. It uses Second-Order Riemannian Brownian Efficient Sampling for local exploration and Mutation Enumeration in Tangent Space for discrete amino-acid substitutions. These combine into Local Enumeration Bayesian Optimization (LE-BO) for local optimization, and Potential-minimizing Geodesic Search (PoGS) for property-enriched geodesic interpolation.

Result: In-vitro validation showed PoGS yielded four novel seed peptides, and subsequent LE-BO optimization discovered 25 highly active peptides with broad-spectrum activity, including against resistant bacterial strains.

Conclusion: Geometry-informed exploration provides a powerful new paradigm for antimicrobial peptide design, overcoming limitations of conventional flat Euclidean approaches and enabling more efficient discovery of active peptides.

Abstract: Antimicrobial peptide discovery is challenged by the astronomical size of peptide space and the relative scarcity of active peptides. Generative models provide continuous latent “maps” of peptide space, but conventionally ignore decoder-induced geometry and rely on flat Euclidean metrics, rendering exploration and optimization distorted and inefficient. Prior manifold-based remedies assume fixed intrinsic dimensionality, which critically fails in practice for peptide data. Here, we introduce PepCompass, a geometry-aware framework for peptide exploration and optimization. At its core, we define a Union of $\kappa$-Stable Riemannian Manifolds $\mathbb{M}^{\kappa}$, a family of decoder-induced manifolds that captures local geometry while ensuring computational stability. We propose two local exploration methods: Second-Order Riemannian Brownian Efficient Sampling, which provides a convergent second-order approximation to Riemannian Brownian motion, and Mutation Enumeration in Tangent Space, which reinterprets tangent directions as discrete amino-acid substitutions. Combining these yields Local Enumeration Bayesian Optimization (LE-BO), an efficient algorithm for local activity optimization. Finally, we introduce Potential-minimizing Geodesic Search (PoGS), which interpolates between prototype embeddings along property-enriched geodesics, biasing discovery toward seeds, i.e. peptides with favorable activity. In-vitro validation confirms the effectiveness of PepCompass: PoGS yields four novel seeds, and subsequent optimization with LE-BO discovers 25 highly active peptides with broad-spectrum activity, including against resistant bacterial strains. These results demonstrate that geometry-informed exploration provides a powerful new paradigm for antimicrobial peptide design.

[475] Truncated Kernel Stochastic Gradient Descent with General Losses and Spherical Radial Basis Functions

Jinhui Bai, Andreas Christmann, Lei Shi

Main category: cs.LG

TL;DR: A novel kernel SGD algorithm with improved efficiency and scalability through adaptive regularization and finite-dimensional projection, achieving minimax-optimal convergence rates and reduced computational complexity.

DetailsMotivation: To address the computational inefficiency and scalability limitations of traditional kernel SGD methods for large-scale supervised learning with general losses.

Method: Uses an innovative regularization strategy that projects stochastic gradients onto finite-dimensional hypothesis spaces via spherical radial basis function expansion, with adaptive scaling based on bias-variance trade-off. Incorporates coordinate-wise updates from linear SGD to reduce computational complexity.

Result: Proves both last iterate and suffix average converge at minimax-optimal rates, establishes optimal strong convergence in RKHS. Algorithm significantly reduces computational and storage complexity while maintaining performance across various loss functions.

Conclusion: The proposed kernel SGD algorithm provides an efficient and scalable solution for large-scale supervised learning, achieving optimal convergence rates with reduced computational burden, making it suitable for streaming data applications.

Abstract: In this paper, we propose a novel kernel stochastic gradient descent (SGD) algorithm for large-scale supervised learning with general losses. Compared to traditional kernel SGD, our algorithm improves efficiency and scalability through an innovative regularization strategy. By leveraging the infinite series expansion of spherical radial basis functions, this strategy projects the stochastic gradient onto a finite-dimensional hypothesis space, which is adaptively scaled according to the bias-variance trade-off, thereby enhancing generalization performance. Based on a new estimation of the spectral structure of the kernel-induced covariance operator, we develop an analytical framework that unifies optimization and generalization analyses. We prove that both the last iterate and the suffix average converge at minimax-optimal rates, and we further establish optimal strong convergence in the reproducing kernel Hilbert space. Our framework accommodates a broad class of classical loss functions, including least-squares, Huber, and logistic losses. Moreover, the proposed algorithm significantly reduces computational complexity and achieves optimal storage complexity by incorporating coordinate-wise updates from linear SGD, thereby avoiding the costly pairwise operations typical of kernel SGD and enabling efficient processing of streaming data. Finally, extensive numerical experiments demonstrate the efficiency of our approach.

[476] On the Impossibility of Retrain Equivalence in Machine Unlearning

Jiatong Yu, Yinghui He, Anirudh Goyal, Sanjeev Arora

Main category: cs.LG

TL;DR: Machine unlearning faces fundamental barriers in multi-stage training pipelines due to path dependence, making Retrain Equivalence impossible for path-oblivious algorithms.

DetailsMotivation: Modern ML pipelines use multi-stage training (e.g., LLM fine-tuning), but machine unlearning was formulated for i.i.d. data batches. This creates a gap between theory and practice.

Method: Theoretical analysis and empirical experiments on LLMs (Llama and Qwen, 1B-14B) using gradient ascent, NPO, and SimNPO unlearning algorithms across different training stage orderings.

Result: Unlearning outcomes are path-dependent - models fine-tuned via different orderings of identical stages diverge during unlearning, with GSM8K accuracy degradation varying by over 20% across paths. Some paths produce models that unlearn slowly.

Conclusion: Retrain Equivalence is ill-posed for local unlearning in multi-stage training. When training histories are unavailable, machine unlearning definitions and desiderata need rethinking.

Abstract: Machine unlearning seeks to selectively remove the “influence” of specific training data on a model’s outputs. The ideal goal is Retrain Equivalence–behavior identical to a model trained from scratch on only the retained data. This goal was formulated for models trained on i.i.d. data batches, but modern pipelines often involve multi-stage training, with each stage having a distinct data distribution and objective. Examples include LLM fine-tuning for alignment, reasoning ability, etc. Our study shows via theory and experiments that this shift to multi-stage training introduces a fundamental barrier for machine unlearning. The theory indicates that the outcome of local unlearning–methods that only use gradients computed on the forget set–is path-dependent. That is, a model’s behavior during unlearning is influenced by the order of its training stages during learning, making it impossible for path-oblivious algorithms to universally achieve Retrain Equivalence. We empirically demonstrate the same phenomenon in LLM post-training across Llama and Qwen models (1B to 14B) with gradient ascent, NPO, and SimNPO local unlearning algorithms. Models fine-tuned via different orderings of identical training stages diverge in behavior during unlearning, with the degradation in GSM8K accuracy after unlearning varying by over 20% across paths. We also observe that some learning paths consistently produce models that unlearn slowly. During unlearning, whether the probability mass gets squeezed into paraphrasing or alternative concepts is also path-dependent. These results consistently show that Retrain Equivalence is an ill-posed target for local unlearning algorithms, so long as the target models are trained in stages. In situations where access to models’ training histories is hard, the current work calls for rethinking the definition and desiderata of machine unlearning.

[477] Elementary, My Dear Watson: Non-Invasive Neural Keyword Spotting in the LibriBrain Dataset

Gereon Elvers, Gilad Landau, Oiwi Parker Jones

Main category: cs.LG

TL;DR: The paper proposes Keyword Spotting (KWS) as a practical intermediate task for non-invasive BCIs, using the LibriBrain corpus with standardized splits and robust evaluation metrics including AUPRC and FA/h.

DetailsMotivation: Current BCI benchmarks focus on simple tasks like Speech Detection, while application-ready tasks like Brain-to-Text remain challenging. KWS is proposed as a privacy-aware, practically applicable intermediate task.

Method: Used the 52-hour LibriBrain corpus with standardized train/validation/test splits. Developed a compact 1-D Conv/ResNet baseline with focal loss and top-k pooling, trainable on consumer GPUs. Employed AUPRC and FA/h metrics for evaluation.

Result: The reference model achieved approximately 13x the permutation baseline AUPRC on held-out sessions. Analysis showed log-linear performance scaling with training hours and identified word-level factors (frequency and duration) that affect detectability.

Conclusion: KWS is a viable intermediate task for BCIs, with predictable scaling and systematic factors influencing detectability. The authors release updated tools to support community research.

Abstract: Non-invasive brain-computer interfaces (BCIs) are beginning to benefit from large, public benchmarks. However, current benchmarks target relatively simple, foundational tasks like Speech Detection and Phoneme Classification, while application-ready results on tasks like Brain-to-Text remain elusive. We propose Keyword Spotting (KWS) as a practically applicable, privacy-aware intermediate task. Using the deep 52-hour, within-subject LibriBrain corpus, we provide standardized train/validation/test splits for reproducible benchmarking, and adopt an evaluation protocol tailored to extreme class imbalance. Concretely, we use area under the precision-recall curve (AUPRC) as a robust evaluation metric, complemented by false alarms per hour (FA/h) at fixed recall to capture user-facing trade-offs. To simplify deployment and further experimentation within the research community, we are releasing an updated version of the pnpl library with word-level dataloaders and Colab-ready tutorials. As an initial reference model, we present a compact 1-D Conv/ResNet baseline with focal loss and top-k pooling that is trainable on a single consumer-class GPU. The reference model achieves approximately 13x the permutation baseline AUPRC on held-out sessions, demonstrating the viability of the task. Exploratory analyses reveal: (i) predictable within-subject scaling - performance improves log-linearly with more training hours - and (ii) the existence of word-level factors (frequency and duration) that systematically modulate detectability.

[478] Linearized Optimal Transport for Analysis of High-Dimensional Point-Cloud and Single-Cell Data

Tianxiang Wang, Yingtong Ke, Dhananjay Bhaskar, Smita Krishnaswamy, Alexander Cloninger

Main category: cs.LG

TL;DR: LOT framework embeds single-cell point clouds into Euclidean space for interpretable classification and synthetic data generation, bridging predictive performance with biological interpretability.

DetailsMotivation: Single-cell technologies produce irregular point clouds that are difficult to compare between patients, and existing nonlinear methods lack interpretability despite good predictive accuracy.

Method: Adapt Linear Optimal Transport (LOT) framework to embed irregular point clouds into fixed-dimensional Euclidean space while preserving distributional structure and optimal transport geometry.

Result: LOT enables accurate and interpretable classification of COVID-19 patient states with weights mapping to specific markers, and supports synthetic data generation for patient-derived organoids via barycenters.

Conclusion: LOT provides a unified framework that transforms heterogeneous point clouds into structured embeddings, enabling interpretable analysis of immune variation and treatment effects in biological systems.

Abstract: Single-cell technologies generate high-dimensional point clouds of cells, enabling detailed characterization of complex patient states and treatment responses. Yet each patient is represented by an irregular point cloud rather than a simple vector, making it difficult to directly quantify and compare biological differences between individuals. Nonlinear methods such as kernels and neural networks achieve predictive accuracy but act as black boxes, offering little biological interpretability. To address these limitations, we adapt the Linear Optimal Transport (LOT) framework to this setting, embedding irregular point clouds into a fixed-dimensional Euclidean space while preserving distributional structure. This embedding provides a principled linear representation that preserves optimal transport geometry while enabling downstream analysis. It also forms a registration between any two patients, enabling direct comparison of their cellular distributions. Within this space, LOT enables: (i) \textbf{accurate and interpretable classification} of COVID-19 patient states, where classifier weights map back to specific markers and spatial regions driving predictions; and (ii) \textbf{synthetic data generation} for patient-derived organoids, exploiting the linearity of the LOT embedding. LOT barycenters yield averaged cellular profiles representing combined conditions or samples, supporting drug interaction testing. Together, these results establish LOT as a unified framework that bridges predictive performance, interpretability, and generative modeling. By transforming heterogeneous point clouds into structured embeddings directly traceable to the original data, LOT opens new opportunities for understanding immune variation and treatment effects in high-dimensional biological systems.

[479] Hierarchical Graph Networks for Accurate Weather Forecasting via Lightweight Training

Thomas Bailie, S. Karthik Mukkavilli, Varvara Vetrova, Yun Sing Koh

Main category: cs.LG

TL;DR: HiFlowCast and HiAntFlow are hierarchical graph neural networks that embed physics in multiscale weather prediction, preserving global trends and integrating PDE solutions across scales to improve accuracy and reliability.

DetailsMotivation: Accurate weather prediction is challenging due to multiscale physical processes that fixed-resolution methods cannot capture, and existing HGNNs often lose global trends during downward mappings.

Method: Introduces Latent-Memory-Retention mechanism to preserve global trends during downward traversal, and Latent-to-Physics branch to integrate PDE solution fields across diverse scales.

Result: Reduces errors by over 5% at 13-day lead times and 5-8% under extreme quantiles, with convergence within a single epoch using pretrained weights.

Conclusion: The approach improves weather forecasting reliability for rare events while reducing computational costs and environmental impact, addressing sustainability concerns in large-scale ML research.

Abstract: Climate events arise from intricate, multivariate dynamics governed by global-scale drivers, profoundly impacting food, energy, and infrastructure. Yet, accurate weather prediction remains elusive due to physical processes unfolding across diverse spatio-temporal scales, which fixed-resolution methods cannot capture. Hierarchical Graph Neural Networks (HGNNs) offer a multiscale representation, but nonlinear downward mappings often erase global trends, weakening the integration of physics into forecasts. We introduce HiFlowCast and its ensemble variant HiAntFlow, HGNNs that embed physics within a multiscale prediction framework. Two innovations underpin their design: a Latent-Memory-Retention mechanism that preserves global trends during downward traversal, and a Latent-to-Physics branch that integrates PDE solution fields across diverse scales. Our Flow models cut errors by over 5% at 13-day lead times and by 5-8% under 1st and 99th quantile extremes, improving reliability for rare events. Leveraging pretrained model weights, they converge within a single epoch, reducing training cost and their carbon footprint. Such efficiency is vital as the growing scale of machine learning challenges sustainability and limits research accessibility. Code and model weights are in the supplementary materials.

[480] Localized Kernel Projection Outlyingness: A Two-Stage Approach for Multi-Modal Outlier Detection

Akira Tamamori

Main category: cs.LG

TL;DR: Two-Stage LKPLO is a multi-stage outlier detection framework that addresses limitations of conventional methods by combining adaptive loss functions, kernel PCA for non-linear data, and local clustering for multi-modal distributions, achieving state-of-the-art performance.

DetailsMotivation: To overcome the coexisting limitations of conventional projection-based outlier detection methods: their reliance on fixed statistical metrics and assumption of single data structures.

Method: A two-stage framework that synthesizes: (1) generalized loss-based outlyingness measure (PLO) with flexible loss functions like SVM-like loss, (2) global kernel PCA stage to linearize non-linear data structures, and (3) local clustering stage to handle multi-modal distributions.

Result: Achieved state-of-the-art performance in comprehensive 5-fold cross-validation experiments on 10 benchmark datasets, significantly outperforming baselines on challenging datasets like Optdigits (multi-cluster) and Arrhythmia (high-dimensional). Ablation study confirmed the necessity of both kernelization and localization stages.

Conclusion: The work contributes a powerful new tool for outlier detection problems and demonstrates the importance of hybrid, multi-stage architectures for handling complex data structures.

Abstract: This paper presents Two-Stage LKPLO, a novel multi-stage outlier detection framework that overcomes the coexisting limitations of conventional projection-based methods: their reliance on a fixed statistical metric and their assumption of a single data structure. Our framework uniquely synthesizes three key concepts: (1) a generalized loss-based outlyingness measure (PLO) that replaces the fixed metric with flexible, adaptive loss functions like our proposed SVM-like loss; (2) a global kernel PCA stage to linearize non-linear data structures; and (3) a subsequent local clustering stage to handle multi-modal distributions. Comprehensive 5-fold cross-validation experiments on 10 benchmark datasets, with automated hyperparameter optimization, demonstrate that Two-Stage LKPLO achieves state-of-the-art performance. It significantly outperforms strong baselines on datasets with challenging structures where existing methods fail, most notably on multi-cluster data (Optdigits) and complex, high-dimensional data (Arrhythmia). Furthermore, an ablation study empirically confirms that the synergistic combination of both the kernelization and localization stages is indispensable for its superior performance. This work contributes a powerful new tool for a significant class of outlier detection problems and underscores the importance of hybrid, multi-stage architectures.

[481] Send Less, Save More: Energy-Efficiency Benchmark of Embedded CNN Inference vs. Data Transmission in IoT

Benjamin Karic, Nina Herrmann, Jan Stenkamp, Paula Scharf, Fabian Gieseke, Angela Schwering

Main category: cs.LG

TL;DR: This paper evaluates using compressed CNNs on ESP32-S3 microcontrollers with Low Power Wide Area Networks for environmental monitoring, showing that on-device inference reduces energy consumption by up to 5x compared to transmitting raw image data.

DetailsMotivation: Environmental challenges require effective remote monitoring solutions, but designing energy-efficient IoT devices for long-term operation in remote areas with limited power is challenging, especially for image-based monitoring applications.

Method: The researchers used compressed Convolutional Neural Networks trained on domain-specific datasets running on ESP32-S3 microcontrollers, combined with Low Power Wide Area Networks, performing inference directly on the devices to reduce data transmission.

Result: Executing CNN inference on-device and transmitting only the results reduces overall energy consumption by a factor of up to five compared to sending raw image data.

Conclusion: These findings support developing IoT applications with reduced carbon footprint that can operate autonomously in environmental monitoring scenarios by incorporating Embedded Machine Learning.

Abstract: The integration of the Internet of Things (IoT) and Artificial Intelligence offers significant opportunities to enhance our ability to monitor and address ecological changes. As environmental challenges become increasingly pressing, the need for effective remote monitoring solutions is more critical than ever. A major challenge in designing IoT applications for environmental monitoring - particularly those involving image data - is to create energy-efficient IoT devices capable of long-term operation in remote areas with limited power availability. Advancements in the field of Tiny Machine Learning allow the use of Convolutional Neural Networks (CNNs) on resource-constrained, battery-operated microcontrollers. Since data transfer is energy-intensive, performing inference directly on microcontrollers to reduce the message size can extend the operational lifespan of IoT nodes. This work evaluates the use of common Low Power Wide Area Networks and compressed CNNs trained on domain specific datasets on an ESP32-S3. Our experiments demonstrate, among other things, that executing CNN inference on-device and transmitting only the results reduces the overall energy consumption by a factor of up to five compared to sending raw image data. These findings advocate the development of IoT applications with reduced carbon footprint and capable of operating autonomously in environmental monitoring scenarios by incorporating EmbeddedML.

[482] Adaptive EEG-based stroke diagnosis with a GRU-TCN classifier and deep Q-learning thresholding

Shakeel Abdulkareem, Bora Yimenicioglu, Khartik Uppalapati, Aneesh Gudipati, Adan Eftekhari, Saleh Yassin

Main category: cs.LG

TL;DR: An adaptive multitask EEG classifier using GRU-TCN and deep Q-network achieves high accuracy (98%) for stroke type classification, severity, and lateralization, with real-time threshold adaptation.

DetailsMotivation: Rapid triage of suspected stroke requires accurate bedside tools; EEG is promising but underused at first contact.

Method: Converts 32-channel EEG signals to power spectral density features, uses GRU-TCN network to predict stroke type, lateralization, and severity, and applies DQN for real-time threshold tuning.

Result: Baseline GRU-TCN achieved 89.3% accuracy for stroke type, 96.9% for severity, and 96.7% for lateralization. With DQN adaptation, stroke-type accuracy increased to 98.0%. Robustness confirmed on independent cohort.

Conclusion: Adaptive thresholding enables clinically preferred sensitivity-specificity trade-offs, with interpretable visualizations supporting deployment.

Abstract: Rapid triage of suspected stroke needs accurate, bedside-deployable tools; EEG is promising but underused at first contact. We present an adaptive multitask EEG classifier that converts 32-channel signals to power spectral density features (Welch), uses a recurrent-convolutional network (GRU-TCN) to predict stroke type (healthy, ischemic, hemorrhagic), hemispheric lateralization, and severity, and applies a deep Q-network (DQN) to tune decision thresholds in real time. Using a patient-wise split of the UCLH Stroke EIT/EEG data set (44 recordings; about 26 acute stroke, 10 controls), the primary outcome was stroke-type performance; secondary outcomes were severity and lateralization. The baseline GRU-TCN reached 89.3% accuracy (F1 92.8%) for stroke type, about 96.9% (F1 95.9%) for severity, and about 96.7% (F1 97.4%) for lateralization. With DQN threshold adaptation, stroke-type accuracy increased to about 98.0% (F1 97.7%). We also tested robustness on an independent, low-density EEG cohort (ZJU4H) and report paired patient-level statistics. Analyses follow STARD 2015 guidance for diagnostic accuracy studies (index test: GRU-TCN+DQN; reference standard: radiology/clinical diagnosis; patient-wise evaluation). Adaptive thresholding shifts the operating point to clinically preferred sensitivity-specificity trade-offs, while integrated scalp-map and spectral visualizations support interpretability.

cs.MA

[483] Magentic Marketplace: An Open-Source Environment for Studying Agentic Markets

Gagan Bansal, Wenyue Hua, Zezhou Huang, Adam Fourney, Amanda Swearngin, Will Epperson, Tyler Payne, Jake M. Hofman, Brendan Lucier, Chinmay Singh, Markus Mobius, Akshay Nambi, Archana Yadav, Kevin Gao, David M. Rothschild, Aleksandrs Slivkins, Daniel G. Goldstein, Hussein Mozannar, Nicole Immorlica, Maya Murad, Matthew Vogel, Subbarao Kambhampati, Eric Horvitz, Saleema Amershi

Main category: cs.MA

TL;DR: This paper studies LLM agents in realistic two-sided marketplaces, revealing performance degrades with scale and severe first-proposal bias favoring speed over quality.

DetailsMotivation: As LLM agents increasingly mediate economic decisions, there's a need to understand their behavior in realistic market conditions beyond constrained settings.

Method: Developed Magentic-Marketplace - a simulated environment with Assistant agents (consumers) and Service agents (businesses) to study market dynamics, biases, and search mechanisms.

Result: Frontier models approach optimal welfare under ideal search conditions but performance degrades sharply with scale. All models show severe first-proposal bias creating 10-30x advantages for response speed over quality.

Conclusion: Findings reveal how agent behaviors emerge across market conditions, informing the design of fair and efficient agentic marketplaces.

Abstract: As LLM agents advance, they are increasingly mediating economic decisions, ranging from product discovery to transactions, on behalf of users. Such applications promise benefits but also raise many questions about agent accountability and value for users. Addressing these questions requires understanding how agents behave in realistic market conditions. However, previous research has largely evaluated agents in constrained settings, such as single-task marketplaces (e.g., negotiation) or structured two-agent interactions. Real-world markets are fundamentally different: they require agents to handle diverse economic activities and coordinate within large, dynamic ecosystems where multiple agents with opaque behaviors may engage in open-ended dialogues. To bridge this gap, we investigate two-sided agentic marketplaces where Assistant agents represent consumers and Service agents represent competing businesses. To study these interactions safely, we develop Magentic-Marketplace – a simulated environment where Assistants and Services can operate. This environment enables us to study key market dynamics: the utility agents achieve, behavioral biases, vulnerability to manipulation, and how search mechanisms shape market outcomes. Our experiments show that frontier models can approach optimal welfare – but only under ideal search conditions. Performance degrades sharply with scale, and all models exhibit severe first-proposal bias, creating 10-30x advantages for response speed over quality. These findings reveal how behaviors emerge across market conditions, informing the design of fair and efficient agentic marketplaces.

[484] Multi-Agent Reinforcement Learning for Market Making: Competition without Collusion

Ziyi Wang, Carmine Ventre, Maria Polukarov

Main category: cs.MA

TL;DR: A hierarchical multi-agent RL framework studies algorithmic collusion in market making, showing adaptive agents can achieve dominant performance while maintaining more sustainable market coexistence than purely competitive agents.

DetailsMotivation: To understand emergent behavior in AI-driven markets, particularly algorithmic collusion and how different agent objectives affect market outcomes and efficiency.

Method: Hierarchical multi-agent RL framework with self-interested market maker (Agent A) trained against adversary, plus three competitor agents: profit-maximizing (B1), competitive (B2), and hybrid adaptive (B*) agents.

Result: Agent B2 dominates in zero-sum settings but harms other agents’ rewards, while hybrid Agent B* achieves dominant market share with milder adverse impact on others, supporting more sustainable coexistence.

Conclusion: Adaptive incentive control enables more sustainable strategic coexistence in heterogeneous agent environments and provides structured evaluation for algorithmic trading system design.

Abstract: Algorithmic collusion has emerged as a central question in AI: Will the interaction between different AI agents deployed in markets lead to collusion? More generally, understanding how emergent behavior, be it a cartel or market dominance from more advanced bots, affects the market overall is an important research question. We propose a hierarchical multi-agent reinforcement learning framework to study algorithmic collusion in market making. The framework includes a self-interested market maker (AgentA), which is trained in an uncertain environment shaped by an adversary, and three bottom-layer competitors: the self-interested AgentB1 (whose objective is to maximize its own PnL), the competitive AgentB2 (whose objective is to minimize the PnL of its opponent), and the hybrid AgentB$^\star$, which can modulate between the behavior of the other two. To analyze how these agents shape the behavior of each other and affect market outcomes, we propose interaction-level metrics that quantify behavioral asymmetry and system-level dynamics, while providing signals potentially indicative of emergent interaction patterns. Experimental results show that AgentB2 secures dominant performance in a zero-sum setting against B1, aggressively capturing order flow while tightening average spreads, thus improving market execution efficiency. In contrast, AgentB$^\star$ exhibits a self-interested inclination when co-existing with other profit-seeking agents, securing dominant market share through adaptive quoting, yet exerting a milder adverse impact on the rewards of Agents~A and B1 compared to B2. These findings suggest that adaptive incentive control supports more sustainable strategic co-existence in heterogeneous agent environments and offers a structured lens for evaluating behavioral design in algorithmic trading systems.

[485] Stop Wasting Your Tokens: Towards Efficient Runtime Multi-Agent Systems

Fulin Lin, Shaowen Chen, Ruishan Fang, Hongwei Wang, Tao Lin

Main category: cs.MA

TL;DR: SupervisorAgent is a lightweight framework for runtime supervision of multi-agent systems that reduces token consumption by 29.45% on GAIA benchmark while maintaining success rates.

DetailsMotivation: Multi-agent systems face critical inefficiencies like excessive token consumption and misinformation failures, with existing methods focusing only on post-hoc failure attribution rather than proactive interventions.

Method: A lightweight, modular framework with LLM-free adaptive filter that intervenes at critical junctures to correct errors, guide inefficient behaviors, and purify observations without altering base agent architecture.

Result: Reduced token consumption by 29.45% on GAIA benchmark without compromising success rate; validated across five additional benchmarks (math reasoning, code generation, question answering) with various state-of-the-art foundation models.

Conclusion: SupervisorAgent provides broad applicability and robustness for enhancing efficiency and reliability in multi-agent systems through runtime adaptive supervision.

Abstract: While Multi-Agent Systems (MAS) excel at complex tasks, their growing autonomy with operational complexity often leads to critical inefficiencies, such as excessive token consumption and failures arising from misinformation. Existing methods primarily focus on post-hoc failure attribution, lacking proactive, real-time interventions to enhance robustness and efficiency. To this end, we introduce SupervisorAgent, a lightweight and modular framework for runtime, adaptive supervision that operates without altering the base agent’s architecture. Triggered by an LLM-free adaptive filter, SupervisorAgent intervenes at critical junctures to proactively correct errors, guide inefficient behaviors, and purify observations. On the challenging GAIA benchmark, SupervisorAgent reduces the token consumption of the Smolagent framework by an average of 29.45% without compromising its success rate. Extensive experiments across five additional benchmarks (math reasoning, code generation, and question answering) and various SoTA foundation models validate the broad applicability and robustness of our approach. The code is available at https://github.com/LINs-lab/SupervisorAgent.

[486] A General Incentives-Based Framework for Fairness in Multi-agent Resource Allocation

Ashwin Kumar, William Yeoh

Main category: cs.MA

TL;DR: GIFF is a fairness framework for multi-agent resource allocation that uses Q-functions to balance efficiency and fairness without extra training, achieving equitable outcomes across various domains.

DetailsMotivation: Address the problem where agents optimizing for efficiency in resource-constrained settings create inequitable outcomes, requiring a method to balance efficiency with fairness.

Method: Uses action-value functions to compute local fairness gains and counterfactual advantage corrections to prevent over-allocation to well-off agents, implemented in centralized control settings.

Result: Outperforms baselines in domains like ridesharing, homelessness prevention, and job allocation, discovering far-sighted equitable policies with theoretical guarantees.

Conclusion: GIFF provides a robust, principled framework using standard RL components to achieve equitable outcomes in complex multi-agent systems.

Abstract: We introduce the General Incentives-based Framework for Fairness (GIFF), a novel approach for fair multi-agent resource allocation that infers fair decision-making from standard value functions. In resource-constrained settings, agents optimizing for efficiency often create inequitable outcomes. Our approach leverages the action-value (Q-)function to balance efficiency and fairness without requiring additional training. Specifically, our method computes a local fairness gain for each action and introduces a counterfactual advantage correction term to discourage over-allocation to already well-off agents. This approach is formalized within a centralized control setting, where an arbitrator uses the GIFF-modified Q-values to solve an allocation problem. Empirical evaluations across diverse domains, including dynamic ridesharing, homelessness prevention, and a complex job allocation task-demonstrate that our framework consistently outperforms strong baselines and can discover far-sighted, equitable policies. The framework’s effectiveness is supported by a theoretical foundation; we prove its fairness surrogate is a principled lower bound on the true fairness improvement and that its trade-off parameter offers monotonic tuning. Our findings establish GIFF as a robust and principled framework for leveraging standard reinforcement learning components to achieve more equitable outcomes in complex multi-agent systems.

cs.MM

[487] Contribution-Guided Asymmetric Learning for Robust Multimodal Fusion under Imbalance and Noise

Zijing Xu, Yunfeng Kou, Kunming Wu, Hong Liu

Main category: cs.MM

TL;DR: CAL proposes a novel modality compression paradigm that enhances high-contribution modalities while compressing weak modalities to improve multimodal fusion performance, addressing modality imbalance and data noise challenges.

DetailsMotivation: Existing multimodal learning methods suppress dominant modalities but ignore inherent differences in modality information value, leading to suboptimal solutions. The paper aims to address modality imbalance and data noise for better robustness and generalization.

Method: CAL uses a modality contribution metric W^m combining information quantity I(m) and confidence D(m), with asymmetric gradient acceleration and contribution-aware Asymmetric Information Bottleneck (AIB) compression mechanisms.

Result: CAL achieves state-of-the-art performance on five benchmark datasets: 79.30% on CREMA-D, 74.82% on KS, and 74.21% on AVE, significantly outperforming ARL. It also shows leading performance in high-noise robustness tests on MVSA-Single and NYUD2 datasets.

Conclusion: CAL demonstrates significant advantages in handling modality imbalance and noise interference, offering a flexible and efficient framework with broad adaptability and application potential for multimodal learning tasks.

Abstract: Multimodal learning faces two major challenges: modality imbalance and data noise, which significantly affect the robustness and generalization ability of models. Existing methods achieve modality balance by suppressing dominant modalities, but they neglect the inherent differences in the information value between modalities, potentially leading to convergence to suboptimal solutions. This paper proposes an innovative modality compression paradigm, Contribution-Guided Asymmetric Learning (CAL), which aims to enhance the contribution of high-contribution modalities while compressing weak modalities to increase their contribution, allowing both to improve the performance of multimodal information fusion. CAL is based on a modality contribution metric W^m combining the information quantity I(m) and confidence D(m), and it designs an asymmetric gradient acceleration mechanism and a contribution-aware Asymmetric Information Bottleneck (AIB) compression mechanism. The former accelerates the gradient update of modalities, while the latter dynamically compresses the noise of low-contribution modalities. On five benchmark datasets, including emotion recognition, scene recognition, and event localization tasks, CAL has shown outstanding performance in imbalanced fusion tasks and noise robustness tests. On CREMA-D, KS, and AVE, CAL achieves 79.30%, 74.82%, and 74.21% accuracy, significantly outperforming the existing state-of-the-art model ARL. In high-noise robustness tests, CAL also achieved leading performance under various attack strategies on the MVSA-Single and NYUD2 datasets. These results validate the significant advantages of CAL in modality imbalance and noise interference. CAL, as a flexible and efficient framework, is easy to transfer to other tasks and has broad adaptability and potential application prospects.

[488] Quality Over Quantity? LLM-Based Curation for a Data-Efficient Audio-Video Foundation Model

Ali Vosoughi, Dimitra Emmanouilidou, Hannes Gamper

Main category: cs.MM

TL;DR: AVVA framework improves audio-video alignment using LLM-based data curation and contrastive learning, achieving better retrieval performance with less training data.

DetailsMotivation: Current multimodal foundational models struggle with effective audio-visual integration beyond simple temporal synchronization.

Method: Uses Whisper for audio and DINOv2 for video in dual-encoder structure with contrastive learning, plus LLM-based data curation with scoring mechanism.

Result: Significant improvement in top-k accuracies for video-to-audio retrieval on AudioCaps, VALOR, and VGGSound using only 192 hours of curated data.

Conclusion: Data curation effectively trades data quality for quantity, yielding better performance than training on uncurated data.

Abstract: Integrating audio and visual data for training multimodal foundational models remains a challenge. The Audio-Video Vector Alignment (AVVA) framework addresses this by considering AV scene alignment beyond mere temporal synchronization, and leveraging Large Language Models (LLMs) for data curation. AVVA implements a scoring mechanism for selecting aligned training data segments. It integrates Whisper, a speech-based foundation model, for audio and DINOv2 for video analysis in a dual-encoder structure with contrastive learning on AV pairs. Evaluations on AudioCaps, VALOR, and VGGSound demonstrate the effectiveness of the proposed model architecture and data curation approach. AVVA achieves a significant improvement in top-k accuracies for video-to-audio retrieval on all datasets compared to DenseAV, while using only 192 hrs of curated training data. Furthermore, an ablation study indicates that the data curation process effectively trades data quality for data quantity, yielding increases in top-k retrieval accuracies on AudioCaps, VALOR, and VGGSound, compared to training on the full spectrum of uncurated data.

[489] PureKV: Plug-and-Play KV Cache Optimization with Spatial-Temporal Sparse Attention for Vision-Language Large Models

Zhonghua Jiang, Kunxi Li, Yiyun Zhou, Sihao Liu, Zhaode Wang, Chengfei lv, Shengyu Zhang

Main category: cs.MM

TL;DR: PureKV is a plug-and-play framework that jointly optimizes sparse attention and KV cache compression for Vision-Language Large Models, achieving 5x KV cache compression and 3.16x prefill acceleration with minimal quality loss.

DetailsMotivation: VLLMs face efficiency challenges from quadratic attention complexity and growing KV cache size during prefilling and decoding. Existing KV cache compression methods are incompatible with efficient attention mechanisms like FlashAttention and don't account for how sparse attention alters KV cache structure.

Method: Proposes PureKV with two components: 1) KV cache compression using lower layer attention scores to estimate importance of high layers’ KV cache, enabling active pruning without accuracy loss, and 2) Spatial-Temporal Sparse Attention (ST-SpAttn) module that combines spatial and temporal attention sparsity to purify noise and redundancy in KV cache.

Result: Extensive experiments on VideoLLaMA2 and Qwen2.5-VL show PureKV achieves 5.0x KV cache compression and 3.16x prefill acceleration with negligible quality degradation.

Conclusion: PureKV effectively addresses efficiency challenges in VLLMs by providing a joint optimization framework for sparse attention and KV cache compression that is compatible with efficient attention mechanisms and maintains model quality.

Abstract: Vision-Language Large Models (VLLMs) face significant efficiency challenges when processing high-resolution inputs. The quadratic complexity in attention and autoregressive generation, as well as the constantly growing key value (KV) cache size, severely hinder the prefilling and decoding stages. Recent efforts have attempted to compress KV cache by identifying and pruning KV cache of less important tokens, but these methods typically rely on attention scores to estimate token importance, making them incompatible with efficient attention mechanisms such as FlashAttention and Sparse Attention, which do not explicitly compute attention matrices. Moreover, existing methods overlook how sparse attention, while accelerating the prefilling stage, alters the information structure of the KV cache, thereby compromising the effectiveness of downstream KV cache compression strategies. To address this issue, we propose PureKV, a plug-and-play framework for joint optimization of sparse attention and KV cache compression. We first introduce a KV cache compression strategy that is fully compatible with efficient attention accelerators. Our method utilizes lower layer attention scores to estimate the importance of high layers’ KV cache, enabling active pruning without compromising accuracy. In addition, we have designed a Spatial-Temporal Sparse Attention (ST-SpAttn) module specifically tailored for video KV cache compression algorithms. This module combines spatial and temporal attention sparsity to improve the compression efficiency of KV cache optimization algorithms by purifying spatial noise and temporal redundancy in KV cache. At the same time, ST-SpAttn also accelerated the prefilling stage of VLLMs. Extensive experiments on VLLMs (VideoLLaMA2, Qwen2.5-VL) have shown that PureKV achieves 5.0 times KV cache compression and 3.16 times prefill acceleration, with negligible quality degradation.

eess.AS

[490] SPEAR: A Unified SSL Framework for Learning Speech and Audio Representations

Xiaoyu Yang, Yifan Yang, Zengrui Jin, Ziyun Cui, Wen Wu, Baoxiang Li, Chao Zhang, Phil Woodland

Main category: eess.AS

TL;DR: SPEAR is the first SSL framework that learns unified speech and audio representations from mixed data using masked prediction of fine-grained discrete tokens derived via Multi-codebook Vector Quantisation.

DetailsMotivation: Current SSL methods are domain-specific (speech or audio only), hindering development of unified representation models with comprehensive capabilities across both domains.

Method: Proposes unified pre-training objective based on masked prediction of fine-grained discrete tokens for both speech and audio, using Multi-codebook Vector Quantisation (MVQ) to retain rich acoustic detail.

Result: Speech-domain model achieves new SOTA on SUPERB benchmark, matching/surpassing WavLM Large on 12/15 tasks. Unified model shows comprehensive capabilities across SUPERB and HEAR benchmarks. Scaled 600M parameter model excels in both domains.

Conclusion: SPEAR successfully learns unified speech and audio representations, establishing powerful and versatile SSL models for auditory understanding that outperform domain-specific approaches.

Abstract: Self-Supervised Learning (SSL) excels at learning generic representations of acoustic signals, yet prevailing methods remain domain-specific, tailored to either speech or general audio, hindering the development of a unified representation model with a comprehensive capability over both domains. To address this, we present SPEAR (SPEech and Audio Representations), the first SSL framework to successfully learn unified speech and audio representations from a mixture of speech and audio data. SPEAR proposes a unified pre-training objective based on masked prediction of fine-grained discrete tokens for both speech and general audio. These tokens are derived from continuous speech and audio representations using a Multi-codebook Vector Quantisation (MVQ) method, retaining rich acoustic detail essential for modelling both speech and complex audio events. SPEAR is applied to pre-train both single-domain and unified speech-and-audio SSL models. Our speech-domain model establishes a new state-of-the-art on the SUPERB benchmark, a speech processing benchmark for SSL models, matching or surpassing the highly competitive WavLM Large on 12 out of 15 tasks with the same pre-training corpora and a similar model size. Crucially, our unified model learns complementary features and demonstrates comprehensive capabilities across two major benchmarks, SUPERB and HEAR, for evaluating audio representations. By further scaling up the model size and pre-training data, we present a unified model with 600M parameters that excels in both domains, establishing it as one of the most powerful and versatile open-source SSL models for auditory understanding. The inference code and pre-trained models will be made publicly available.

[491] Phoenix-VAD: Streaming Semantic Endpoint Detection for Full-Duplex Speech Interaction

Weijie Wu, Wenhao Guan, Kaidi Wang, Peijie Chen, Zhuanling Zha, Junbo Li, Jun Fang, Lin Li, Qingyang Hong

Main category: eess.AS

TL;DR: Phoenix-VAD is an LLM-based streaming semantic endpoint detection model that enables plug-and-play full-duplex prediction for seamless audio interactions in spoken dialogue systems.

DetailsMotivation: Current spoken dialogue models lack a plug-and-play full-duplex prediction module for semantic endpoint detection, which hinders seamless audio interactions in human-computer interaction systems.

Method: Leverages LLM’s semantic comprehension capability with a sliding window training strategy to achieve reliable semantic endpoint detection while supporting streaming inference.

Result: Achieves excellent and competitive performance on both semantically complete and incomplete speech scenarios in experiments.

Conclusion: The design enables independent optimization of the full-duplex prediction module, providing more reliable and flexible support for next-generation human-computer interaction.

Abstract: Spoken dialogue models have significantly advanced intelligent human-computer interaction, yet they lack a plug-and-play full-duplex prediction module for semantic endpoint detection, hindering seamless audio interactions. In this paper, we introduce Phoenix-VAD, an LLM-based model that enables streaming semantic endpoint detection. Specifically, Phoenix-VAD leverages the semantic comprehension capability of the LLM and a sliding window training strategy to achieve reliable semantic endpoint detection while supporting streaming inference. Experiments on both semantically complete and incomplete speech scenarios indicate that Phoenix-VAD achieves excellent and competitive performance. Furthermore, this design enables the full-duplex prediction module to be optimized independently of the dialogue model, providing more reliable and flexible support for next-generation human-computer interaction.

[492] DiffRhythm 2: Efficient and High Fidelity Song Generation via Block Flow Matching

Yuepeng Jiang, Huakang Chen, Ziqian Ning, Jixun Yao, Zerui Han, Di Wu, Meng Meng, Jian Luan, Zhonghua Fu, Lei Xie

Main category: eess.AS

TL;DR: DiffRhythm 2 is an end-to-end framework for high-fidelity, controllable song generation that addresses lyric-vocal alignment and multi-preference optimization challenges using semi-autoregressive architecture and cross-pair preference optimization.

DetailsMotivation: Existing non-autoregressive frameworks struggle with lyric-vocal alignment and reinforcement learning from human feedback often causes performance degradation when merging models for multi-preference optimization.

Method: Uses semi-autoregressive architecture based on block flow matching for lyric alignment, music VAE for low frame rate processing, cross-pair preference optimization for RLHF, and stochastic block representation alignment loss for musical coherence.

Result: The framework achieves faithful lyric-vocal alignment without external constraints while maintaining generation quality and efficiency, with computationally tractable processing of long sequences at 5 Hz frame rate.

Conclusion: DiffRhythm 2 provides an effective solution for high-quality song generation with improved lyric alignment and robust multi-preference optimization, overcoming key limitations of existing methods.

Abstract: Generating full-length, high-quality songs is challenging, as it requires maintaining long-term coherence both across text and music modalities and within the music modality itself. Existing non-autoregressive (NAR) frameworks, while capable of producing high-quality songs, often struggle with the alignment between lyrics and vocal. Concurrently, catering to diverse musical preferences necessitates reinforcement learning from human feedback (RLHF). However, existing methods often rely on merging multiple models during multi-preference optimization, which results in significant performance degradation. To address these challenges, we introduce DiffRhythm 2, an end-to-end framework designed for high-fidelity, controllable song generation. To tackle the lyric alignment problem, DiffRhythm 2 employs a semi-autoregressive architecture based on block flow matching. This design enables faithful alignment of lyrics to singing vocals without relying on external labels and constraints, all while preserving the high generation quality and efficiency of NAR models. To make this framework computationally tractable for long sequences, we implement a music variational autoencoder (VAE) that achieves a low frame rate of 5 Hz while still enabling high-fidelity audio reconstruction. In addition, to overcome the limitations of multi-preference optimization in RLHF, we propose cross-pair preference optimization. This method effectively mitigates the performance drop typically associated with model merging, allowing for more robust optimization across diverse human preferences. We further enhance musicality and structural coherence by introducing stochastic block representation alignment loss.

eess.IV

[493] Groupwise Registration with Physics-Informed Test-Time Adaptation on Multi-parametric Cardiac MRI

Xinqi Li, Yi Zhang, Li-Ting Huang, Hsiao-Huang Chang, Thoralf Niendorf, Min-Chi Ku, Qian Tao, Hsin-Jung Yang

Main category: eess.IV

TL;DR: A physics-informed deep learning model with test-time adaptation for group image registration of multiparametric MRI maps to address misalignment issues between different tissue contrast images.

DetailsMotivation: Misalignment between multiparametric MRI maps (e.g., T1 and T2 mapping) makes pixel-wise analysis challenging, requiring a solution for accurate registration across different contrast-weighted images.

Method: Developed a generalizable physics-informed deep-learning model using test-time adaptation, utilizing synthetic images from specific physics models as registration references for transductive learning across various tissue contrasts.

Result: Validated in healthy volunteers with various MRI sequences, demonstrating improved multi-modal registration performance across a wide range of image contrast variability.

Conclusion: The proposed physics-informed deep learning approach effectively addresses misalignment challenges in multiparametric mapping MRI, enabling more reliable pixel-wise analysis for myocardial tissue characterization.

Abstract: Multiparametric mapping MRI has become a viable tool for myocardial tissue characterization. However, misalignment between multiparametric maps makes pixel-wise analysis challenging. To address this challenge, we developed a generalizable physics-informed deep-learning model using test-time adaptation to enable group image registration across contrast weighted images acquired from multiple physical models (e.g., a T1 mapping model and T2 mapping model). The physics-informed adaptation utilized the synthetic images from specific physics model as registration reference, allows for transductive learning for various tissue contrast. We validated the model in healthy volunteers with various MRI sequences, demonstrating its improvement for multi-modal registration with a wide range of image contrast variability.

[494] Functional Connectome Fingerprinting Using Convolutional and Dictionary Learning

Yashaswini, Sanjay Ghosh

Main category: eess.IV

TL;DR: A framework combining convolutional autoencoders and sparse dictionary learning improves fMRI functional connectivity fingerprinting accuracy by 10% over baseline methods.

DetailsMotivation: To enhance individual identification from fMRI data by leveraging neural connectivity variability, addressing limitations of traditional methods with large datasets.

Method: Combines convolutional autoencoders to capture shared connectivity patterns and isolate subject-specific features in residual FC matrices, then uses sparse dictionary learning to identify distinctive features.

Result: Achieved 10% improvement in fingerprint accuracy over baseline group-averaged FC models on the Human Connectome Project dataset.

Conclusion: Integration of deep learning and sparse coding enables scalable and robust functional connectome fingerprinting for personalized neuroscience applications.

Abstract: Advances in data analysis and machine learning have revolutionized the study of brain signatures using fMRI, enabling non-invasive exploration of cognition and behavior through individual neural patterns. Functional connectivity (FC), which quantifies statistical relationships between brain regions, has emerged as a key metric for studying individual variability and developing biomarkers for personalized medicine in neurological and psychiatric disorders. The concept of subject fingerprinting, introduced by Finn et al. (2015), leverages neural connectivity variability to identify individuals based on their unique patterns. While traditional FC methods perform well on small datasets, machine learning techniques are more effective with larger datasets, isolating individual-specific features and maximizing inter-subject differences. In this study, we propose a framework combining convolutional autoencoders and sparse dictionary learning to enhance fingerprint accuracy. Autoencoders capture shared connectivity patterns while isolating subject-specific features in residual FC matrices, which are analyzed using sparse coding to identify distinctive features. Tested on the Human Connectome Project dataset, this approach achieved a 10% improvement over baseline group-averaged FC models. Our results highlight the potential of integrating deep learning and sparse coding techniques for scalable and robust functional connectome fingerprinting, advancing personalized neuroscience applications and biomarker development.

[495] BitSemCom: A Bit-Level Semantic Communication Framework with Learnable Probabilistic Mapping

Haoshuo Zhang, Yufei Bo, Jianhua Mo, Meixia Tao

Main category: eess.IV

TL;DR: BitSemCom is a novel bit-level semantic communication framework that enables joint source-channel coding at the bit level using a learnable bit mapper with Gumbel-Softmax trick, achieving competitive performance and superior robustness while maintaining low complexity.

DetailsMotivation: Existing semantic communication systems use analog modulation which is incompatible with modern digital systems, and current digital approaches lack end-to-end bit-level methods that are robust to noise and free from quantization errors.

Method: Proposed BitSemCom framework with modular learnable bit mapper that establishes probabilistic mapping between continuous semantic features and discrete bits using Gumbel-Softmax trick for differentiable bit generation.

Result: BitSemCom achieves competitive performance and superior robustness compared to traditional SSCC schemes, outperforms deep learning based JSCC with uniform 1-bit quantization, while adding only 0.42% parameters and 0.09% computational complexity.

Conclusion: BitSemCom provides a lightweight and practical solution for real-world semantic communication that is compatible with arbitrary modulation formats and robust to channel noise.

Abstract: Most existing semantic communication systems employ analog modulation, which is incompatible with modern digital communication systems. Although several digital transmission approaches have been proposed to address this issue, an end-to-end bit-level method that is compatible with arbitrary modulation formats, robust to channel noise, and free from quantization errors remains lacking. To this end, we propose BitSemCom, a novel bit-level semantic communication framework that realizes true joint source-channel coding (JSCC) at the bit level. Specifically, we introduce a modular learnable bit mapper that establishes a probabilistic mapping between continuous semantic features and discrete bits, utilizing the Gumbel-Softmax trick to enable differentiable bit generation. Simulation results on image transmission demonstrate that BitSemCom achieves both competitive performance and superior robustness compared to traditional separate source-channel coding (SSCC) schemes, and outperforms deep learning based JSCC with uniform 1-bit quantization, validating the effectiveness of the learnable bit mapper. Despite these improvements, the bit mapper adds only 0.42% parameters and 0.09% computational complexity, making BitSemCom a lightweight and practical solution for real-world semantic communication.

[496] MORE: Multi-Organ Medical Image REconstruction Dataset

Shaokai Wu, Yapan Guo, Yanbiao Ji, Jing Tong, Yuxiang Lu, Mei Li, Suizhi Huang, Yue Ding, Hongtao Lu

Main category: eess.IV

TL;DR: The paper introduces the MORE dataset for multi-organ CT reconstruction, containing 9 anatomies and 15 lesion types to improve model generalization, and establishes a baseline method that outperforms prior approaches.

DetailsMotivation: Current deep learning methods for CT reconstruction are limited to specific anatomies and datasets, hindering generalization to unseen anatomies and lesions.

Method: Created the MORE dataset with CT scans across 9 diverse anatomies and 15 lesion types, and established a strong baseline solution for CT reconstruction.

Result: The comprehensive dataset improves model generalization capability, and optimization-based methods offer enhanced robustness for unseen anatomies.

Conclusion: The MORE dataset enables robust training and evaluation of CT reconstruction models, demonstrating improved generalization across diverse anatomies and lesions.

Abstract: CT reconstruction provides radiologists with images for diagnosis and treatment, yet current deep learning methods are typically limited to specific anatomies and datasets, hindering generalization ability to unseen anatomies and lesions. To address this, we introduce the Multi-Organ medical image REconstruction (MORE) dataset, comprising CT scans across 9 diverse anatomies with 15 lesion types. This dataset serves two key purposes: (1) enabling robust training of deep learning models on extensive, heterogeneous data, and (2) facilitating rigorous evaluation of model generalization for CT reconstruction. We further establish a strong baseline solution that outperforms prior approaches under these challenging conditions. Our results demonstrate that: (1) a comprehensive dataset helps improve the generalization capability of models, and (2) optimization-based methods offer enhanced robustness for unseen anatomies. The MORE dataset is freely accessible under CC-BY-NC 4.0 at our project page https://more-med.github.io/

[497] SPG-CDENet: Spatial Prior-Guided Cross Dual Encoder Network for Multi-Organ Segmentation

Xizhi Tian, Changjun Zhou, Yulin. Yang

Main category: eess.IV

TL;DR: SPG-CDENet is a two-stage network for multi-organ segmentation that uses spatial priors and cross-attention between global and local encoders to handle organ size/shape variations.

DetailsMotivation: Address challenges in multi-organ segmentation caused by huge variations in organ size and shape that limit effectiveness of existing deep learning methods.

Method: Two-stage approach with spatial prior network for coarse localization and cross dual encoder network with global/local encoders, symmetric cross-attention module, and flow-based decoder for feature propagation.

Result: Superior performance demonstrated on two public datasets compared to existing segmentation methods, with ablation studies validating effectiveness of proposed modules.

Conclusion: SPG-CDENet effectively improves multi-organ segmentation accuracy through spatial guidance and enhanced feature interaction between global and local contexts.

Abstract: Multi-organ segmentation is a critical task in computer-aided diagnosis. While recent deep learning methods have achieved remarkable success in image segmentation, huge variations in organ size and shape challenge their effectiveness in multi-organ segmentation. To address these challenges, we propose a Spatial Prior-Guided Cross Dual Encoder Network (SPG-CDENet), a novel two-stage segmentation paradigm designed to improve multi-organ segmentation accuracy. Our SPG-CDENet consists of two key components: a spatial prior network and a cross dual encoder network. The prior network generates coarse localization maps that delineate the approximate ROI, serving as spatial guidance for the dual encoder network. The cross dual encoder network comprises four essential components: a global encoder, a local encoder, a symmetric cross-attention module, and a flow-based decoder. The global encoder captures global semantic features from the entire image, while the local encoder focuses on features from the prior network. To enhance the interaction between the global and local encoders, a symmetric cross-attention module is proposed across all layers of the encoders to fuse and refine features. Furthermore, the flow-based decoder directly propagates high-level semantic features from the final encoder layer to all decoder layers, maximizing feature preservation and utilization. Extensive qualitative and quantitative experiments on two public datasets demonstrate the superior performance of SPG-CDENet compared to existing segmentation methods. Furthermore, ablation studies further validate the effectiveness of the proposed modules in improving segmentation accuracy.

[498] Comparative Analysis of Deep Learning Models for Olive Tree Crown and Shadow Segmentation Towards Biovolume Estimation

Wondimagegn Abebe Demissie, Stefano Roccella, Rudy Rossetto, Antonio Minnocci, Andrea Vannini, Luca Sebastiani

Main category: eess.IV

TL;DR: Comparative analysis of three deep learning models (U-Net, YOLOv11m-seg, Mask R-CNN) for olive tree crown and shadow segmentation in UAV imagery, with Mask R-CNN achieving best accuracy for biovolume estimation while YOLOv11m-seg offers fastest processing.

DetailsMotivation: Olive tree biovolume estimation is crucial for precision agriculture, yield prediction, and resource management in Mediterranean regions affected by climate-induced stress.

Method: Used three deep learning models (U-Net, YOLOv11m-seg, Mask R-CNN) on UAV imagery with manual crown and shadow annotations. Estimated biovolume by combining crown projected area with shadow-derived height using solar geometry.

Result: Mask R-CNN achieved best accuracy (F1 = 0.86; mIoU = 0.72), YOLOv11m-seg provided fastest throughput (0.12s/image). Estimated biovolumes ranged from 4 to 24 cubic meters, showing structural differences among trees.

Conclusion: Mask R-CNN is preferred for accuracy-critical applications, YOLOv11m-seg for speed-critical large-area deployments, and U-Net as lightweight high-sensitivity option. Framework enables scalable orchard monitoring with potential for DEM/DSM integration and field calibration.

Abstract: Olive tree biovolume estimation is a key task in precision agriculture, supporting yield prediction and resource management, especially in Mediterranean regions severely impacted by climate-induced stress. This study presents a comparative analysis of three deep learning models U-Net, YOLOv11m-seg, and Mask RCNN for segmenting olive tree crowns and their shadows in ultra-high resolution UAV imagery. The UAV dataset, acquired over Vicopisano, Italy, includes manually annotated crown and shadow masks. Building on these annotations, the methodology emphasizes spatial feature extraction and robust segmentation; per-tree biovolume is then estimated by combining crown projected area with shadow-derived height using solar geometry. In testing, Mask R-CNN achieved the best overall accuracy (F1 = 0.86; mIoU = 0.72), while YOLOv11m-seg provided the fastest throughput (0.12 second per image). The estimated biovolumes spanned from approximately 4 to 24 cubic meters, reflecting clear structural differences among trees. These results indicate Mask R-CNN is preferable when biovolume accuracy is paramount, whereas YOLOv11m-seg suits large-area deployments where speed is critical; U-Net remains a lightweight, high-sensitivity option. The framework enables accurate, scalable orchard monitoring and can be further strengthened with DEM or DSM integration and field calibration for operational decision support.

[499] SAMRI: Segment Anything Model for MRI

Zhao Wang, Wei Dai, Thuy Thanh Dao, Steffen Bollmann, Hongfu Sun, Craig Engstrom, Shekhar S. Chandra

Main category: eess.IV

TL;DR: SAMRI is an MRI-specialized adaptation of Segment Anything Model (SAM) that achieves state-of-the-art MRI segmentation with 94% reduced training time and 96% fewer trainable parameters through a two-stage fine-tuning strategy.

DetailsMotivation: Manual MRI segmentation is labor-intensive, while CNN-based methods struggle with MRI's variable contrast, intensity inhomogeneity, and protocol variations. Although SAM shows strong generalizability in natural images, existing adaptations overlook MRI-specific challenges.

Method: Fine-tune SAM’s mask decoder using a two-stage strategy on 1.1 million labeled MR slices spanning whole-body organs and pathologies, rather than full-model retraining.

Result: Achieves mean Dice of 0.87 across diverse MRI segmentation tasks, with state-of-the-art accuracy across anatomical regions and robust generalization on unseen structures, especially small and clinically important structures.

Conclusion: SAM can be effectively adapted to MRI through simple fine-tuning of its mask decoder, achieving excellent performance while dramatically reducing training requirements.

Abstract: Accurate magnetic resonance imaging (MRI) segmentation is crucial for clinical decision-making, but remains labor-intensive when performed manually. Convolutional neural network (CNN)-based methods can be accurate and efficient, but often generalize poorly to MRI’s variable contrast, intensity inhomogeneity, and protocols. Although the transformer-based Segment Anything Model (SAM) has demonstrated remarkable generalizability in natural images, existing adaptations often treat MRI as another imaging modality, overlooking these modality-specific challenges. We present SAMRI, an MRI-specialized SAM trained and validated on 1.1 million labeled MR slices spanning whole-body organs and pathologies. We demonstrate that SAM can be effectively adapted to MRI by simply fine-tuning its mask decoder using a two-stage strategy, reducing training time by 94% and trainable parameters by 96% versus full-model retraining. Across diverse MRI segmentation tasks, SAMRI achieves a mean Dice of 0.87, delivering state-of-the-art accuracy across anatomical regions and robust generalization on unseen structures, particularly small and clinically important structures.

[500] BRIQA: Balanced Reweighting in Image Quality Assessment of Pediatric Brain MRI

Alya Almsouti, Ainur Khamitova, Darya Taratynova, Mohammad Yaqub

Main category: eess.IV

TL;DR: BRIQA is a method for automated assessment of artifact severity in pediatric brain MRI that addresses class imbalance through gradient-based loss reweighting and rotating batching, improving classification performance across multiple artifact types.

DetailsMotivation: Manual quality assessment of pediatric brain MRI artifacts is time-consuming and subjective, especially in low-field systems with reduced signal-to-noise ratio, creating need for robust automated solutions.

Method: BRIQA uses gradient-based loss reweighting to dynamically adjust per-class contributions and employs a rotating batching scheme to ensure consistent exposure to underrepresented classes in artifact severity classification.

Result: BRIQA improves average macro F1 score from 0.659 to 0.706, with significant gains across Noise (0.430), Zipper (0.098), Positioning (0.097), Contrast (0.217), Motion (0.022), and Banding (0.012) artifact severity classification.

Conclusion: No single architecture performs best across all artifact types, emphasizing the importance of architectural diversity, and rotating batching combined with cross-entropy loss improves performance across metrics by promoting balanced learning.

Abstract: Assessing the severity of artifacts in pediatric brain Magnetic Resonance Imaging (MRI) is critical for diagnostic accuracy, especially in low-field systems where the signal-to-noise ratio is reduced. Manual quality assessment is time-consuming and subjective, motivating the need for robust automated solutions. In this work, we propose BRIQA (Balanced Reweighting in Image Quality Assessment), which addresses class imbalance in artifact severity levels. BRIQA uses gradient-based loss reweighting to dynamically adjust per-class contributions and employs a rotating batching scheme to ensure consistent exposure to underrepresented classes. Through experiments, no single architecture performs best across all artifact types, emphasizing the importance of architectural diversity. The rotating batching configuration improves performance across metrics by promoting balanced learning when combined with cross-entropy loss. BRIQA improves average macro F1 score from 0.659 to 0.706, with notable gains in Noise (0.430), Zipper (0.098), Positioning (0.097), Contrast (0.217), Motion (0.022), and Banding (0.012) artifact severity classification. The code is available at https://github.com/BioMedIA-MBZUAI/BRIQA.

[501] ReCon-GS: Continuum-Preserved Gaussian Streaming for Fast and Compact Reconstruction of Dynamic Scenes

Jiaye Fu, Qiankun Gao, Chengxiang Wen, Yanmin Wu, Siwei Ma, Jiaqi Zhang, Jian Zhang

Main category: eess.IV

TL;DR: ReCon-GS is a storage-aware framework for online free-viewpoint video reconstruction that uses multi-level Anchor Gaussians with dynamic hierarchy reconfiguration to achieve efficient training, high-quality rendering, and 50% memory reduction compared to state-of-the-art methods.

DetailsMotivation: To address challenges in online FVV reconstruction including slow per-frame optimization, inconsistent motion estimation, and unsustainable storage demands.

Method: Uses dynamically allocated multi-level Anchor Gaussians in density-adaptive fashion, dynamic hierarchy reconfiguration strategy with on-demand anchor re-hierarchization, and storage-aware optimization that adjusts Gaussian density at different hierarchy levels.

Result: Improves training efficiency by ~15%, achieves superior FVV synthesis quality with enhanced robustness and stability, and reduces memory requirements by over 50% at equivalent rendering quality compared to state-of-the-art methods.

Conclusion: ReCon-GS enables high-fidelity online dynamic scene reconstruction and real-time rendering with significant efficiency gains and storage reduction, demonstrating effectiveness across three widely used datasets.

Abstract: Online free-viewpoint video (FVV) reconstruction is challenged by slow per-frame optimization, inconsistent motion estimation, and unsustainable storage demands. To address these challenges, we propose the Reconfigurable Continuum Gaussian Stream, dubbed ReCon-GS, a novel storage-aware framework that enables high fidelity online dynamic scene reconstruction and real-time rendering. Specifically, we dynamically allocate multi-level Anchor Gaussians in a density-adaptive fashion to capture inter-frame geometric deformations, thereby decomposing scene motion into compact coarse-to-fine representations. Then, we design a dynamic hierarchy reconfiguration strategy that preserves localized motion expressiveness through on-demand anchor re-hierarchization, while ensuring temporal consistency through intra-hierarchical deformation inheritance that confines transformation priors to their respective hierarchy levels. Furthermore, we introduce a storage-aware optimization mechanism that flexibly adjusts the density of Anchor Gaussians at different hierarchy levels, enabling a controllable trade-off between reconstruction fidelity and memory usage. Extensive experiments on three widely used datasets demonstrate that, compared to state-of-the-art methods, ReCon-GS improves training efficiency by approximately 15% and achieves superior FVV synthesis quality with enhanced robustness and stability. Moreover, at equivalent rendering quality, ReCon-GS slashes memory requirements by over 50% compared to leading state-of-the-art methods.

[502] ProstNFound+: A Prospective Study using Medical Foundation Models for Prostate Cancer Detection

Paul F. R. Wilson, Mohamed Harmanani, Minh Nguyen Nhat To, Amoon Jamzad, Tarek Elghareb, Zhuoxin Guo, Adam Kinnaird, Brian Wodlinger, Purang Abolmaesumi, Parvin Mousavi

Main category: eess.IV

TL;DR: ProstNFound+ is a medical foundation model adapted for prostate cancer detection from micro-ultrasound, showing strong generalization in prospective validation and potential for clinical deployment.

DetailsMotivation: Medical foundation models offer potential for high-performance diagnostic systems, but their application to prostate cancer detection from micro-ultrasound remains untested in clinical settings.

Method: ProstNFound+ incorporates a medical foundation model, adapter tuning, and custom prompt encoder that embeds prostate cancer-specific clinical biomarkers. It generates cancer heatmaps and risk scores for clinically significant prostate cancer.

Result: The model shows strong generalization to prospective data with no performance degradation compared to retrospective evaluation. It aligns closely with clinical scores (PRI-MUS and PI-RADS) and produces interpretable heatmaps consistent with biopsy-confirmed lesions.

Conclusion: The results highlight its potential for clinical deployment, offering a scalable and interpretable alternative to expert-driven protocols.

Abstract: Purpose: Medical foundation models (FMs) offer a path to build high-performance diagnostic systems. However, their application to prostate cancer (PCa) detection from micro-ultrasound ({\mu}US) remains untested in clinical settings. We present ProstNFound+, an adaptation of FMs for PCa detection from {\mu}US, along with its first prospective validation. Methods: ProstNFound+ incorporates a medical FM, adapter tuning, and a custom prompt encoder that embeds PCa-specific clinical biomarkers. The model generates a cancer heatmap and a risk score for clinically significant PCa. Following training on multi-center retrospective data, the model is prospectively evaluated on data acquired five years later from a new clinical site. Model predictions are benchmarked against standard clinical scoring protocols (PRI-MUS and PI-RADS). Results: ProstNFound+ shows strong generalization to the prospective data, with no performance degradation compared to retrospective evaluation. It aligns closely with clinical scores and produces interpretable heatmaps consistent with biopsy-confirmed lesions. Conclusion: The results highlight its potential for clinical deployment, offering a scalable and interpretable alternative to expert-driven protocols.

[503] FASL-Seg: Anatomy and Tool Segmentation of Surgical Scenes

Muraam Abdel-Ghani, Mahmoud Ali, Mohamed Ali, Fatmaelzahraa Ahmed, Muhammad Arsalan, Abdulaziz Al-Ali, Shidin Balakrishnan

Main category: eess.IV

TL;DR: FASL-Seg is a novel surgical scene segmentation model that uses dual processing streams (low-level and high-level feature projections) to capture multi-scale features for precise segmentation of both anatomical objects and surgical instruments, achieving state-of-the-art performance on surgical benchmarks.

DetailsMotivation: Current surgical segmentation models focus mainly on surgical tools and overlook anatomical objects, while existing SOTA models struggle to balance high-level contextual features with low-level edge features needed for precise segmentation.

Method: Proposed Feature-Adaptive Spatial Localization model (FASL-Seg) with two distinct processing streams: Low-Level Feature Projection (LLFP) and High-Level Feature Projection (HLFP) streams to capture features at multiple levels of detail for varying feature resolutions.

Result: Achieved mIoU of 72.71% on parts and anatomy segmentation in EndoVis18 (5% improvement over SOTA), mIoU of 85.61% on EndoVis18 tool type segmentation, and 72.78% on EndoVis17 tool type segmentation, outperforming SOTA overall performance with comparable per-class results.

Conclusion: The dual-stream architecture effectively handles varying feature resolutions, demonstrating consistent performance across different classes for both anatomy and surgical instruments, proving the effectiveness of distinct processing streams for surgical scene segmentation.

Abstract: The growing popularity of robotic minimally invasive surgeries has made deep learning-based surgical training a key area of research. A thorough understanding of the surgical scene components is crucial, which semantic segmentation models can help achieve. However, most existing work focuses on surgical tools and overlooks anatomical objects. Additionally, current state-of-the-art (SOTA) models struggle to balance capturing high-level contextual features and low-level edge features. We propose a Feature-Adaptive Spatial Localization model (FASL-Seg), designed to capture features at multiple levels of detail through two distinct processing streams, namely a Low-Level Feature Projection (LLFP) and a High-Level Feature Projection (HLFP) stream, for varying feature resolutions - enabling precise segmentation of anatomy and surgical instruments. We evaluated FASL-Seg on surgical segmentation benchmark datasets EndoVis18 and EndoVis17 on three use cases. The FASL-Seg model achieves a mean Intersection over Union (mIoU) of 72.71% on parts and anatomy segmentation in EndoVis18, improving on SOTA by 5%. It further achieves a mIoU of 85.61% and 72.78% in EndoVis18 and EndoVis17 tool type segmentation, respectively, outperforming SOTA overall performance, with comparable per-class SOTA results in both datasets and consistent performance in various classes for anatomy and instruments, demonstrating the effectiveness of distinct processing streams for varying feature resolutions.

Daeyoung Kim

Main category: eess.IV

TL;DR: GCVAMD is a novel causal AMD analysis model that uses modified CausalVAE to extract latent causal factors from OCT images, enabling causal inference for treatment simulation and intervention analysis on AMD risk factors like drusen and neovascularization.

DetailsMotivation: Previous deep learning methods focused only on prediction performance without understanding AMD pathologies or causal mechanisms, which limits intervention analysis and reliability. There's a need for models that can identify underlying causal factors of AMD.

Method: GCVAMD incorporates a modified CausalVAE approach to extract latent causal factors from raw OCT images, enabling causal inference for treatment simulation and intervention analysis on AMD risk factors.

Result: GCVAMD successfully identifies drusen and neovascularization status with AMD causal mechanisms in latent spaces, which can be used for various tasks from AMD classification to intervention analysis.

Conclusion: The proposed GCVAMD model enables causal analysis of AMD by extracting meaningful latent causal factors from OCT images, providing capabilities for intervention analysis and enhanced downstream tasks beyond simple classification.

Abstract: Age Related Macular Degeneration(AMD) has been one of the most leading causes of permanent vision impairment in ophthalmology. Though treatments, such as anti VEGF drugs or photodynamic therapies, were developed to slow down the degenerative process of AMD, there is still no specific cure to reverse vision loss caused by AMD. Thus, for AMD, detecting existence of risk factors of AMD or AMD itself within the patient retina in early stages is a crucial task to reduce the possibility of vision impairment. Apart from traditional approaches, deep learning based methods, especially attention mechanism based CNNs and GradCAM based XAI analysis on OCT scans, exhibited successful performance in distinguishing AMD retina from normal retinas, making it possible to use AI driven models to aid medical diagnosis and analysis by ophthalmologists regarding AMD. However, though having significant success, previous works mostly focused on prediction performance itself, not pathologies or underlying causal mechanisms of AMD, which can prohibit intervention analysis on specific factors or even lead to less reliable decisions. Thus, this paper introduces a novel causal AMD analysis model: GCVAMD, which incorporates a modified CausalVAE approach that can extract latent causal factors from only raw OCT images. By considering causality in AMD detection, GCVAMD enables causal inference such as treatment simulation or intervention analysis regarding major risk factors: drusen and neovascularization, while returning informative latent causal features that can enhance downstream tasks. Results show that through GCVAMD, drusen status and neovascularization status can be identified with AMD causal mechanisms in GCVAMD latent spaces, which can in turn be used for various tasks from AMD detection(classification) to intervention analysis.

Last updated: 2025-11-05
Built with Hugo, theme modified on Stack