Daily arXiv Papers - 2025-10-07

AI-enhanced summaries of 0 research papers from arXiv

Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

Table of Contents

cs.CL

[1] Decomposing Attention To Find Context-Sensitive Neurons

Alex Gibson

Main category: cs.CL

TL;DR: The paper analyzes transformer language models, focusing on attention heads with spread-out attention patterns and weak content dependence. It proposes a method to approximate combined outputs of stable heads using linear summaries from calibration text, enabling discovery of neurons responding to high-level contextual properties.

DetailsMotivation: To understand how transformer language models work, particularly focusing on attention heads with stable softmax denominators when token distribution is fixed, and to develop methods to uncover neurons that respond to contextual properties without requiring activation data.

Method: Sample softmax denominators from a calibration text to combine outputs of multiple stable attention heads in GPT2-Small’s first layer. Approximate their combined output using linear summaries of surrounding text, enabling neuron discovery from weights alone.

Result: The method successfully uncovered hundreds of first layer neurons that respond to high-level contextual properties of surrounding text, including neurons that didn’t activate on the calibration text used for calibration.

Conclusion: The proposed calibration-based approach enables effective analysis of transformer language models, revealing how attention mechanisms capture contextual information and allowing discovery of functionally relevant neurons directly from model weights.

Abstract: We study transformer language models, analyzing attention heads whose attention patterns are spread out, and whose attention scores depend weakly on content. We argue that the softmax denominators of these heads are stable when the underlying token distribution is fixed. By sampling softmax denominators from a “calibration text”, we can combine together the outputs of multiple such stable heads in the first layer of GPT2-Small, approximating their combined output by a linear summary of the surrounding text. This approximation enables a procedure where from the weights alone - and a single calibration text - we can uncover hundreds of first layer neurons that respond to high-level contextual properties of the surrounding text, including neurons that didn’t activate on the calibration text.

[2] Graph-S3: Enhancing Agentic textual Graph Retrieval with Synthetic Stepwise Supervision

Ge Chang, Jinbo Su, Jiacheng Liu, Pengfei Yang, Yuhao Shang, Huiwen Zheng, Hongli Ma, Yan Liang, Yuanchun Li, Yunxin Liu

Main category: cs.CL

TL;DR: Graph-S³ is a novel LLM-based framework for textual graph question answering that addresses graph retrieval challenges through synthetic stepwise supervision and agentic reasoning.

DetailsMotivation: Real-world data often exists as textual graphs, but current LLM-based graph QA systems struggle with efficient graph retrieval - either using shallow embeddings or requiring expensive interactive training with sparse rewards.

Method: Proposes Graph-S³ with: 1) synthetic stepwise supervision using offline-extracted golden subgraphs instead of final-answer rewards, 2) data synthesis pipeline for reward generation, and 3) two-stage training for interactive graph exploration policy.

Result: Achieves average improvements of 8.1% in accuracy and 9.7% in F1 score across three datasets compared to seven strong baselines, with even better performance on complex multi-hop reasoning tasks.

Conclusion: The approach effectively addresses graph retrieval challenges in textual graph QA through stepwise supervision and synthetic training, demonstrating significant performance gains especially in complex reasoning scenarios.

Abstract: A significant portion of real-world data is inherently represented as textual graphs, and integrating these graphs into large language models (LLMs) is promising to enable complex graph-based question answering. However, a key challenge in LLM-based textual graph QA systems lies in graph retrieval, i.e., how to retrieve relevant content from large graphs that is sufficiently informative while remaining compact for the LLM context. Existing retrievers suffer from poor performance since they either rely on shallow embedding similarity or employ interactive retrieving policies that demand excessive data labeling and training cost. To address these issues, we present Graph-$S^3$, an agentic textual graph reasoning framework that employs an LLM-based retriever trained with synthetic stepwise supervision. Instead of rewarding the agent based on the final answers, which may lead to sparse and unstable training signals, we propose to closely evaluate each step of the retriever based on offline-extracted golden subgraphs. Our main techniques include a data synthesis pipeline to extract the golden subgraphs for reward generation and a two-stage training scheme to learn the interactive graph exploration policy based on the synthesized rewards. Based on extensive experiments on three common datasets in comparison with seven strong baselines, our approach achieves an average improvement of 8.1% in accuracy and 9.7% in F$_1$ score. The advantage is even higher in more complicated multi-hop reasoning tasks. Our code will be open-sourced.

[3] Implicit Values Embedded in How Humans and LLMs Complete Subjective Everyday Tasks

Arjun Arunasalam, Madison Pickering, Z. Berkay Celik, Blase Ur

Main category: cs.CL

TL;DR: Audit of six popular LLMs shows they often misalign with human values and each other when completing everyday tasks.

DetailsMotivation: To understand the implicit values (like environmentalism, charity, diversity) that AI assistants exhibit when completing subjective everyday tasks, and compare these values with human preferences.

Method: Audited six popular LLMs completing 30 everyday tasks and compared their responses to 100 human crowdworkers from the US.

Result: LLMs often do not align with human values, nor with other LLMs, in the implicit values they exhibit during task completion.

Conclusion: There is significant misalignment between LLM values and human values in everyday task contexts, highlighting a need for better value alignment in AI assistants.

Abstract: Large language models (LLMs) can underpin AI assistants that help users with everyday tasks, such as by making recommendations or performing basic computation. Despite AI assistants’ promise, little is known about the implicit values these assistants display while completing subjective everyday tasks. Humans may consider values like environmentalism, charity, and diversity. To what extent do LLMs exhibit these values in completing everyday tasks? How do they compare with humans? We answer these questions by auditing how six popular LLMs complete 30 everyday tasks, comparing LLMs to each other and to 100 human crowdworkers from the US. We find LLMs often do not align with humans, nor with other LLMs, in the implicit values exhibited.

[4] Morpheme Induction for Emergent Language

Brendon Boldt, David Mortensen

Main category: cs.CL

TL;DR: CSAR is a greedy algorithm for morpheme induction from emergent language corpora that iteratively selects high mutual information form-meaning pairs and removes them from the corpus.

DetailsMotivation: To develop an effective method for inducing morphemes from emergent language corpora containing parallel utterances and meanings.

Method: A greedy algorithm that: (1) weights morphemes by mutual information between forms and meanings, (2) selects highest-weighted pair, (3) removes it from corpus, (4) repeats process (Count, Select, Ablate, Repeat).

Result: Validated on procedurally generated datasets and human language data, showing reasonable predictions. Also used to analyze emergent languages, quantifying characteristics like synonymy and polysemy.

Conclusion: CSAR is an effective algorithm for morpheme induction that performs well on both synthetic and human language data, and can be used to analyze linguistic characteristics of emergent languages.

Abstract: We introduce CSAR, an algorithm for inducing morphemes from emergent language corpora of parallel utterances and meanings. It is a greedy algorithm that (1) weights morphemes based on mutual information between forms and meanings, (2) selects the highest-weighted pair, (3) removes it from the corpus, and (4) repeats the process to induce further morphemes (i.e., Count, Select, Ablate, Repeat). The effectiveness of CSAR is first validated on procedurally generated datasets and compared against baselines for related tasks. Second, we validate CSAR’s performance on human language data to show that the algorithm makes reasonable predictions in adjacent domains. Finally, we analyze a handful of emergent languages, quantifying linguistic characteristics like degree of synonymy and polysemy.

[5] Omni-Embed-Nemotron: A Unified Multimodal Retrieval Model for Text, Image, Audio, and Video

Mengyao Xu, Wenfei Zhou, Yauhen Babakhin, Gabriel Moreira, Ronay Ak, Radek Osmulski, Bo Liu, Even Oldridge, Benedikt Schifferer

Main category: cs.CL

TL;DR: Omni-Embed-Nemotron is a unified multimodal retrieval model that extends beyond text and images to support audio and video modalities, enabling both cross-modal and joint-modal retrieval with a single model.

DetailsMotivation: Existing text-based retrievers struggle with visually and semantically rich content in real-world documents like PDFs, slides, and videos, while recent work shows that preserving document layout using image-based representations improves retrieval quality.

Method: Extends retrieval beyond text and images to support audio and video modalities, building on capabilities of recent multimodal models like Qwen2.5-Omni, enabling cross-modal and joint-modal retrieval using a single unified architecture.

Result: Demonstrates effectiveness in text, image, and video retrieval, showing improved performance for handling complex real-world information needs across multiple modalities.

Conclusion: Omni-Embed-Nemotron provides a unified solution for multimodal retrieval that can handle the increasing complexity of real-world information needs across text, image, audio, and video modalities.

Abstract: We present Omni-Embed-Nemotron, a unified multimodal retrieval embedding model developed to handle the increasing complexity of real-world information needs. While Retrieval-Augmented Generation (RAG) has significantly advanced language models by incorporating external knowledge, existing text-based retrievers rely on clean, structured input and struggle with the visually and semantically rich content found in real-world documents such as PDFs, slides, or videos. Recent work such as ColPali has shown that preserving document layout using image-based representations can improve retrieval quality. Building on this, and inspired by the capabilities of recent multimodal models such as Qwen2.5-Omni, we extend retrieval beyond text and images to also support audio and video modalities. Omni-Embed-Nemotron enables both cross-modal (e.g., text - video) and joint-modal (e.g., text - video+audio) retrieval using a single model. We describe the architecture, training setup, and evaluation results of Omni-Embed-Nemotron, and demonstrate its effectiveness in text, image, and video retrieval.

[6] Searching for the Most Human-like Emergent Language

Brendon Boldt, David Mortensen

Main category: cs.CL

TL;DR: The paper develops a signalling game-based environment with hyperparameter optimization using XferBench to generate emergent languages similar to human language, and analyzes entropy effects and optimal hyperparameters.

DetailsMotivation: To generate state-of-the-art emergent languages that closely resemble human language through systematic optimization.

Method: Uses a signalling game-based emergent communication environment with hyperparameter optimization, employing XferBench as objective function to measure statistical similarity to human language via transfer learning suitability.

Result: Demonstrates entropy’s predictive power on transfer learning performance, confirms entropy-minimization properties, and identifies hyperparameters that produce more realistic emergent languages with better transfer to human language.

Conclusion: The approach successfully generates human-like emergent languages and provides insights into entropy relationships and optimal hyperparameter configurations for language realism.

Abstract: In this paper, we design a signalling game-based emergent communication environment to generate state-of-the-art emergent languages in terms of similarity to human language. This is done with hyperparameter optimization, using XferBench as the objective function. XferBench quantifies the statistical similarity of emergent language to human language by measuring its suitability for deep transfer learning to human language. Additionally, we demonstrate the predictive power of entropy on the transfer learning performance of emergent language as well as corroborate previous results on the entropy-minimization properties of emergent communication systems. Finally, we report generalizations regarding what hyperparameters produce more realistic emergent languages, that is, ones which transfer better to human language.

[7] SEER: The Span-based Emotion Evidence Retrieval Benchmark

Aneesha Sampath, Oya Aran, Emily Mower Provost

Main category: cs.CL

TL;DR: The SEER Benchmark tests LLMs’ ability to identify specific text spans that express emotion, focusing on emotion evidence detection rather than just emotion classification.

DetailsMotivation: Traditional emotion recognition assigns single labels to sentences, but applications like empathetic dialogue and clinical support need to know exactly how emotion is expressed through specific text spans.

Method: Created SEER Benchmark with two tasks: identifying emotion evidence within single sentences and across five-sentence passages. Contains 1200 real-world sentences with new annotations for emotion and emotion evidence. Evaluated 14 open-source LLMs.

Result: Some models approach average human performance on single-sentence inputs, but accuracy degrades in longer passages. Error analysis reveals key failure modes including overreliance on emotion keywords and false positives in neutral text.

Conclusion: Current LLMs struggle with span-level emotion evidence detection, especially in longer contexts, highlighting the need for improved models that can accurately pinpoint emotional expressions in text.

Abstract: We introduce the SEER (Span-based Emotion Evidence Retrieval) Benchmark to test Large Language Models’ (LLMs) ability to identify the specific spans of text that express emotion. Unlike traditional emotion recognition tasks that assign a single label to an entire sentence, SEER targets the underexplored task of emotion evidence detection: pinpointing which exact phrases convey emotion. This span-level approach is crucial for applications like empathetic dialogue and clinical support, which need to know how emotion is expressed, not just what the emotion is. SEER includes two tasks: identifying emotion evidence within a single sentence, and identifying evidence across a short passage of five consecutive sentences. It contains new annotations for both emotion and emotion evidence on 1200 real-world sentences. We evaluate 14 open-source LLMs and find that, while some models approach average human performance on single-sentence inputs, their accuracy degrades in longer passages. Our error analysis reveals key failure modes, including overreliance on emotion keywords and false positives in neutral text.

[8] ALHD: A Large-Scale and Multigenre Benchmark Dataset for Arabic LLM-Generated Text Detection

Ali Khairallah, Arkaitz Zubiaga

Main category: cs.CL

TL;DR: ALHD is the first large-scale Arabic dataset for distinguishing human vs LLM-generated texts across news, social media, and reviews in both MSA and dialectal Arabic, with over 400K balanced samples.

DetailsMotivation: To address the need for comprehensive Arabic LLM detection capabilities to mitigate risks of misinformation, academic dishonesty, and cyber threats in Arabic content.

Method: Created a balanced dataset with 400K+ samples across three genres using three leading LLMs and multiple human sources, with rigorous preprocessing, rich annotations, and standardized splits. Benchmarking used traditional classifiers, BERT-based models, and LLMs (zero-shot/few-shot).

Result: Fine-tuned BERT models achieved competitive performance, outperforming LLM-based models. However, models struggled with cross-genre generalization, particularly with news articles where LLM-generated texts closely resemble human writing style.

Conclusion: ALHD establishes a foundation for Arabic LLM detection research. Challenges in cross-genre generalization, especially with news content, highlight the need for future research to improve detection robustness across different text types.

Abstract: We introduce ALHD, the first large-scale comprehensive Arabic dataset explicitly designed to distinguish between human- and LLM-generated texts. ALHD spans three genres (news, social media, reviews), covering both MSA and dialectal Arabic, and contains over 400K balanced samples generated by three leading LLMs and originated from multiple human sources, which enables studying generalizability in Arabic LLM-genearted text detection. We provide rigorous preprocessing, rich annotations, and standardized balanced splits to support reproducibility. In addition, we present, analyze and discuss benchmark experiments using our new dataset, in turn identifying gaps and proposing future research directions. Benchmarking across traditional classifiers, BERT-based models, and LLMs (zero-shot and few-shot) demonstrates that fine-tuned BERT models achieve competitive performance, outperforming LLM-based models. Results are however not always consistent, as we observe challenges when generalizing across genres; indeed, models struggle to generalize when they need to deal with unseen patterns in cross-genre settings, and these challenges are particularly prominent when dealing with news articles, where LLM-generated texts resemble human texts in style, which opens up avenues for future research. ALHD establishes a foundation for research related to Arabic LLM-detection and mitigating risks of misinformation, academic dishonesty, and cyber threats.

[9] TS-Reasoner: Aligning Time Series Foundation Models with LLM Reasoning

Fangxu Yu, Hongyu Zhao, Tianyi Zhou

Main category: cs.CL

TL;DR: TS-Reasoner aligns time series foundation models with large language models for improved time series reasoning, achieving superior performance with high data efficiency.

DetailsMotivation: Existing time series foundation models capture patterns but lack reasoning capabilities, while LLMs have reasoning but struggle with numerical time series data. Integrating both is challenging due to modality alignment issues.

Method: Proposes TS-Reasoner with two-stage training: alignment pretraining using synthetic time series-text pairs, followed by instruction finetuning. Uses frozen pretrained TSFM and aligns its latent representations with LLM textual inputs.

Result: Outperforms various LLMs, VLMs, and Time Series LLMs across multiple benchmarks while using less than half the training data, demonstrating remarkable data efficiency.

Conclusion: The proposed TS-Reasoner effectively bridges the gap between numerical time series understanding and textual reasoning, providing a powerful framework for time series reasoning tasks.

Abstract: Time series reasoning is crucial to decision-making in diverse domains, including finance, energy usage, traffic, weather, and scientific discovery. While existing time series foundation models (TSFMs) can capture low-level dynamic patterns and provide accurate forecasting, further analysis usually requires additional background knowledge and sophisticated reasoning, which are lacking in most TSFMs but can be achieved through large language models (LLMs). On the other hand, without expensive post-training, LLMs often struggle with the numerical understanding of time series data. Although it is intuitive to integrate the two types of models, developing effective training recipes that align the two modalities for reasoning tasks is still an open challenge. To this end, we propose TS-Reasoner that aligns the latent representations of TSFMs with the textual inputs of LLMs for downstream understanding/reasoning tasks. Specifically, we propose a simple yet effective method to curate diverse, synthetic pairs of time series and textual captions for alignment training. We then develop a two-stage training recipe that applies instruction finetuning after the alignment pretraining. Unlike existing works that train an LLM to take time series as inputs, we leverage a pretrained TSFM and freeze it during training. Extensive experiments on several benchmarks demonstrate that TS-Reasoner not only outperforms a wide range of prevailing LLMs, Vision Language Models (VLMs), and Time Series LLMs, but also achieves this with remarkable data efficiency, e.g., using less than half the training data.

[10] Identifying Financial Risk Information Using RAG with a Contrastive Insight

Ali Elahi

Main category: cs.CL

TL;DR: The paper proposes a peer-aware comparative inference layer on top of RAG to improve specialized reasoning by retrieving and comparing similar cases, addressing RAG’s limitation of providing generic outputs in specialized domains like finance.

DetailsMotivation: RAG systems in specialized domains often produce generic outputs that lack context-specific insights, such as identifying common risks for most companies in finance rather than nuanced, case-specific analysis.

Method: A peer-aware comparative inference layer is added on top of RAG to retrieve comparable cases and related problems, enabling contrastive reasoning and nuanced analysis in specialized contexts.

Result: The contrastive approach outperforms baseline RAG in text generation metrics including ROUGE and BERTScore when compared to human-generated equity research and risk analysis.

Conclusion: Adding comparative reasoning capabilities to RAG systems enhances their performance in specialized domains by enabling more nuanced, context-specific insights rather than generic factual outputs.

Abstract: In specialized domains, humans often compare new problems against similar examples, highlight nuances, and draw conclusions instead of analyzing information in isolation. When applying reasoning in specialized contexts with LLMs on top of a RAG, the pipeline can capture contextually relevant information, but it is not designed to retrieve comparable cases or related problems. While RAG is effective at extracting factual information, its outputs in specialized reasoning tasks often remain generic, reflecting broad facts rather than context-specific insights. In finance, it results in generic risks that are true for the majority of companies. To address this limitation, we propose a peer-aware comparative inference layer on top of RAG. Our contrastive approach outperforms baseline RAG in text generation metrics such as ROUGE and BERTScore in comparison with human-generated equity research and risk.

[11] Sample, Align, Synthesize: Graph-Based Response Synthesis with ConGrs

Sayan Ghosh, Shahzaib Saqib Warraich, Dhruv Tarsadiya, Gregory Yauney, Swabha Swayamdipta

Main category: cs.CL

TL;DR: Consensus Graphs (ConGrs) are DAG-based structures that capture shared information and semantic variation across multiple LM responses, enabling more effective response synthesis through targeted decoding methods.

DetailsMotivation: Existing methods cannot efficiently synthesize rich epistemic signals across different long-form responses from language models, limiting the ability to leverage response variation for improved output quality.

Method: Construct ConGrs using lightweight lexical sequence alignment from bioinformatics supplemented by targeted usage of a secondary LM judge, then design task-dependent decoding methods to synthesize final responses from the ConGr structure.

Result: Improves factual precision on biography generation by up to 31%, reduces reliance on LM judges by over 80%, increases abstention rates on refusal tasks by up to 56%, and improves reasoning task accuracy by up to 6 points over baselines.

Conclusion: ConGrs provide a flexible method for capturing variation in LM responses and using epistemic signals from response variation to synthesize more effective responses across multiple task types.

Abstract: Language models can be sampled multiple times to access the distribution underlying their responses, but existing methods cannot efficiently synthesize rich epistemic signals across different long-form responses. We introduce Consensus Graphs (ConGrs), a flexible DAG-based data structure that represents shared information, as well as semantic variation in a set of sampled LM responses to the same prompt. We construct ConGrs using a light-weight lexical sequence alignment algorithm from bioinformatics, supplemented by the targeted usage of a secondary LM judge. Further, we design task-dependent decoding methods to synthesize a single, final response from our ConGr data structure. Our experiments show that synthesizing responses from ConGrs improves factual precision on two biography generation tasks by up to 31% over an average response and reduces reliance on LM judges by more than 80% compared to other methods. We also use ConGrs for three refusal-based tasks requiring abstention on unanswerable queries and find that abstention rate is increased by up to 56%. We apply our approach to the MATH and AIME reasoning tasks and find an improvement over self-verification and majority vote baselines by up to 6 points of accuracy. We show that ConGrs provide a flexible method for capturing variation in LM responses and using the epistemic signals provided by response variation to synthesize more effective responses.

[12] Fine-Tuning on Noisy Instructions: Effects on Generalization and Performance

Ahmed Alajrami, Xingwei Tan, Nikolaos Aletras

Main category: cs.CL

TL;DR: Instruction-tuning with perturbed instructions (e.g., removing stop words, shuffling words) can improve LLMs’ resistance to noisy instructions and sometimes enhance downstream performance.

DetailsMotivation: LLMs are sensitive to minor variations in instruction phrasing, which affects their usability. This paper explores whether introducing perturbations during instruction-tuning can make models more resilient to noisy user inputs.

Method: Instruction-tuning with perturbations like removing stop words or shuffling words, then evaluating performance on original and perturbed versions of benchmarks (MMLU, BBH, GSM8K). Also assesses learning dynamics and behavior shifts.

Result: Surprisingly, instruction-tuning on perturbed instructions can improve downstream performance in some cases, making LLMs more resilient to noisy instructions.

Conclusion: Including perturbed instructions in instruction-tuning is important for making LLMs more resilient to noisy user inputs, as it can enhance resistance to instruction variations and potentially improve performance.

Abstract: Instruction-tuning plays a vital role in enhancing the task-solving abilities of large language models (LLMs), improving their usability in generating helpful responses on various tasks. However, previous work has demonstrated that they are sensitive to minor variations in instruction phrasing. In this paper, we explore whether introducing perturbations in instruction-tuning data can enhance LLMs’ resistance against noisy instructions. We focus on how instruction-tuning with perturbations, such as removing stop words or shuffling words, affects LLMs’ performance on the original and perturbed versions of widely-used benchmarks (MMLU, BBH, GSM8K). We further assess learning dynamics and potential shifts in model behavior. Surprisingly, our results suggest that instruction-tuning on perturbed instructions can, in some cases, improve downstream performance. These findings highlight the importance of including perturbed instructions in instruction-tuning, which can make LLMs more resilient to noisy user inputs.

[13] TriMediQ: A Triplet-Structured Approach for Interactive Medical Question Answering

Zhaohan Meng, Zaiqiao Meng, Siwei Liu, Iadh Ounis

Main category: cs.CL

TL;DR: TriMediQ improves LLM performance in clinical dialogues by converting patient responses into triplet-based knowledge graphs for multi-hop reasoning, achieving 10.4% accuracy improvement.

DetailsMotivation: LLMs perform well in static medical QA but struggle with interactive clinical dialogues where clinical facts appear in unstructured sentences without clear links.

Method: TriMediQ uses a frozen triplet generator to extract clinical triplets from patient responses, builds a knowledge graph, and employs a trainable projection module with graph encoder and projector for multi-hop reasoning.

Result: TriMediQ achieves up to 10.4% improvement in accuracy over five baselines on the iMedQA dataset.

Conclusion: Converting patient responses into structured triplet-based graphs enables more accurate clinical reasoning in multi-turn settings, providing a solution for LLM-based medical assistants.

Abstract: Large Language Models (LLMs) perform strongly in static and single-turn medical Question Answer (QA) benchmarks, yet such settings diverge from the iterative information gathering process required in practical clinical consultations. The MEDIQ framework addresses this mismatch by recasting the diagnosis as an interactive dialogue between a patient and an expert system, but the reliability of LLMs drops dramatically when forced to reason with dialogue logs, where clinical facts appear in sentences without clear links. To bridge this gap, we introduce TriMediQ, a triplet-structured approach that summarises patient responses into triplets and integrates them into a Knowledge Graph (KG), enabling multi-hop reasoning. We introduce a frozen triplet generator that extracts clinically relevant triplets, using prompts designed to ensure factual consistency. In parallel, a trainable projection module, comprising a graph encoder and a projector, captures relational information from the KG to enhance expert reasoning. TriMediQ operates in two steps: (i) the projection module fine-tuning with all LLM weights frozen; and (ii) using the fine-tuned module to guide multi-hop reasoning during inference. We evaluate TriMediQ on two interactive QA benchmarks, showing that it achieves up to 10.4% improvement in accuracy over five baselines on the iMedQA dataset. These results demonstrate that converting patient responses into structured triplet-based graphs enables more accurate clinical reasoning in multi-turn settings, providing a solution for the deployment of LLM-based medical assistants.

[14] Cross-Lingual Multi-Granularity Framework for Interpretable Parkinson’s Disease Diagnosis from Speech

Ilias Tougui, Mehdi Zakroum, Mounir Ghogho

Main category: cs.CL

TL;DR: A granularity-aware approach for multilingual Parkinson’s Disease detection using phoneme, syllable, and word-level analysis achieves superior performance with phoneme-level analysis (93.78% AUROC).

DetailsMotivation: Current PD detection systems analyze entire utterances, potentially missing diagnostic value in specific phonetic elements. Speech impairments affect up to 89% of PD patients.

Method: Automated pipeline extracts time-aligned phonemes, syllables, and words from recordings. Uses bidirectional LSTM with multi-head attention across Italian, Spanish, and English datasets.

Result: Phoneme-level analysis achieved best performance: 93.78% ± 2.34% AUROC and 92.17% ± 2.43% accuracy. Attention analysis revealed most informative features align with clinical protocols.

Conclusion: Granular phoneme-level analysis provides enhanced diagnostic capability for cross-linguistic PD detection, with features matching established clinical assessment protocols.

Abstract: Parkinson’s Disease (PD) affects over 10 million people worldwide, with speech impairments in up to 89% of patients. Current speech-based detection systems analyze entire utterances, potentially overlooking the diagnostic value of specific phonetic elements. We developed a granularity-aware approach for multilingual PD detection using an automated pipeline that extracts time-aligned phonemes, syllables, and words from recordings. Using Italian, Spanish, and English datasets, we implemented a bidirectional LSTM with multi-head attention to compare diagnostic performance across the different granularity levels. Phoneme-level analysis achieved superior performance with AUROC of 93.78% +- 2.34% and accuracy of 92.17% +- 2.43%. This demonstrates enhanced diagnostic capability for cross-linguistic PD detection. Importantly, attention analysis revealed that the most informative speech features align with those used in established clinical protocols: sustained vowels (/a/, /e/, /o/, /i/) at phoneme level, diadochokinetic syllables (/ta/, /pa/, /la/, /ka/) at syllable level, and /pataka/ sequences at word level. Source code will be available at https://github.com/jetliqs/clearpd.

[15] What is a protest anyway? Codebook conceptualization is still a first-order concern in LLM-era classification

Andrew Halterman, Katherine A. Keith

Main category: cs.CL

TL;DR: LLMs in computational social science can lead to conceptualization errors that bias downstream estimates, which cannot be fixed by improving LLM accuracy alone.

DetailsMotivation: To highlight that conceptualization steps before LLM prompting and using predictions in downstream inference are overlooked in LLM-era computational social science, leading to biased estimates.

Method: Using simulations to demonstrate conceptualization-induced bias and showing it cannot be corrected by increasing LLM accuracy or post-hoc bias correction methods.

Result: Conceptualization-induced bias persists despite improvements in LLM accuracy and cannot be eliminated through standard bias correction techniques.

Conclusion: Conceptualization remains a critical first-order concern in LLM-era CSS, and analysts should follow concrete advice for obtaining low-cost, unbiased, low-variance downstream estimates.

Abstract: Generative large language models (LLMs) are now used extensively for text classification in computational social science (CSS). In this work, focus on the steps before and after LLM prompting – conceptualization of concepts to be classified and using LLM predictions in downstream statistical inference – which we argue have been overlooked in much of LLM-era CSS. We claim LLMs can tempt analysts to skip the conceptualization step, creating conceptualization errors that bias downstream estimates. Using simulations, we show that this conceptualization-induced bias cannot be corrected for solely by increasing LLM accuracy or post-hoc bias correction methods. We conclude by reminding CSS analysts that conceptualization is still a first-order concern in the LLM-era and provide concrete advice on how to pursue low-cost, unbiased, low-variance downstream estimates.

[16] CCD-Bench: Probing Cultural Conflict in Large Language Model Decision-Making

Hasibur Rahman, Hanan Salam

Main category: cs.CL

TL;DR: CCD-Bench is a new benchmark that evaluates how LLMs handle cross-cultural value conflicts, revealing models disproportionately prefer certain cultural clusters (Nordic/Germanic Europe) while underrepresented others, with superficial pluralism in reasoning.

DetailsMotivation: Existing benchmarks focus on cultural knowledge, value prediction, or single-axis bias detection, but none evaluate how LLMs adjudicate when multiple culturally grounded values directly clash.

Method: Created CCD-Bench with 2,182 open-ended dilemmas across 7 domains, paired with 10 anonymized response options corresponding to GLOBE cultural clusters, using stratified Latin square design to mitigate ordering effects. Evaluated 17 non-reasoning LLMs.

Result: Models show strong preference for Nordic Europe (20.2%) and Germanic Europe (12.4%), while Eastern Europe and Middle East/North Africa are underrepresented (5.6-5.8%). Rationales show superficial pluralism - 87.9% reference multiple GLOBE dimensions but mainly recombine Future/Performance Orientation, rarely using Assertiveness or Gender Egalitarianism (<3%).

Conclusion: Current alignment pipelines promote consensus-oriented worldviews that underserve scenarios requiring power negotiation, rights-based reasoning, or gender-aware analysis. CCD-Bench shifts evaluation toward pluralistic decision making and highlights need for alignment strategies that substantively engage diverse worldviews.

Abstract: Although large language models (LLMs) are increasingly implicated in interpersonal and societal decision-making, their ability to navigate explicit conflicts between legitimately different cultural value systems remains largely unexamined. Existing benchmarks predominantly target cultural knowledge (CulturalBench), value prediction (WorldValuesBench), or single-axis bias diagnostics (CDEval); none evaluate how LLMs adjudicate when multiple culturally grounded values directly clash. We address this gap with CCD-Bench, a benchmark that assesses LLM decision-making under cross-cultural value conflict. CCD-Bench comprises 2,182 open-ended dilemmas spanning seven domains, each paired with ten anonymized response options corresponding to the ten GLOBE cultural clusters. These dilemmas are presented using a stratified Latin square to mitigate ordering effects. We evaluate 17 non-reasoning LLMs. Models disproportionately prefer Nordic Europe (mean 20.2 percent) and Germanic Europe (12.4 percent), while options for Eastern Europe and the Middle East and North Africa are underrepresented (5.6 to 5.8 percent). Although 87.9 percent of rationales reference multiple GLOBE dimensions, this pluralism is superficial: models recombine Future Orientation and Performance Orientation, and rarely ground choices in Assertiveness or Gender Egalitarianism (both under 3 percent). Ordering effects are negligible (Cramer’s V less than 0.10), and symmetrized KL divergence shows clustering by developer lineage rather than geography. These patterns suggest that current alignment pipelines promote a consensus-oriented worldview that underserves scenarios demanding power negotiation, rights-based reasoning, or gender-aware analysis. CCD-Bench shifts evaluation beyond isolated bias detection toward pluralistic decision making and highlights the need for alignment strategies that substantively engage diverse worldviews.

[17] Robustness assessment of large audio language models in multiple-choice evaluation

Fernando López, Santosh Kesiraju, Jordi Luque

Main category: cs.CL

TL;DR: The paper reveals that large audio language models (LALMs) are highly sensitive to subtle changes in multiple-choice question answering (MCQA) evaluations, such as choice ordering and paraphrasing, and proposes a new evaluation protocol to address this variability.

DetailsMotivation: Current MCQA evaluation frameworks for LALMs report single accuracy numbers without accounting for sensitivity to subtle variations like choice ordering and paraphrasing, leading to potentially misleading results.

Method: Conducted systematic study across three benchmarks (MMAU, MMAR, MMSU) and four models (Audio Flamingo 2, Audio Flamingo 3, Qwen2.5-Omni-7B-Instruct, Kimi-Audio-7B-Instruct), testing sensitivity to choice ordering, question paraphrasing, and choice paraphrasing.

Result: Found that models are highly sensitive to choice ordering, question paraphrasing, and choice paraphrasing, with substantial performance variations across different configurations of the same questions.

Conclusion: Proposed a simpler evaluation protocol and metric that accounts for subtle variations to provide more detailed and reliable evaluation of LALMs within the MCQA framework.

Abstract: Recent advances in large audio language models (LALMs) have primarily been assessed using a multiple-choice question answering (MCQA) framework. However, subtle changes, such as shifting the order of choices, result in substantially different results. Existing MCQA frameworks do not account for this variability and report a single accuracy number per benchmark or category. We dive into the MCQA evaluation framework and conduct a systematic study spanning three benchmarks (MMAU, MMAR and MMSU) and four models: Audio Flamingo 2, Audio Flamingo 3, Qwen2.5-Omni-7B-Instruct, and Kimi-Audio-7B-Instruct. Our findings indicate that models are sensitive not only to the ordering of choices, but also to the paraphrasing of the question and the choices. Finally, we propose a simpler evaluation protocol and metric that account for subtle variations and provide a more detailed evaluation report of LALMs within the MCQA framework.

[18] Reactive Transformer (RxT) – Stateful Real-Time Processing for Event-Driven Reactive Language Models

Adam Filipek

Main category: cs.CL

TL;DR: The Reactive Transformer (RxT) is a novel architecture that overcomes the stateless nature and quadratic complexity of standard Transformers in conversational AI by using an event-driven paradigm with integrated short-term memory.

DetailsMotivation: Standard Transformers have limitations in conversational AI due to their stateless nature and quadratic computational complexity, leading to prohibitive costs and latency in long dialogues.

Method: RxT processes conversational turns as discrete events, maintaining context in a fixed-size Short-Term Memory system with a generator-decoder for responses and a memory-encoder with Memory Attention for asynchronous memory updates.

Result: RxT reduces computational complexity from quadratic to linear with respect to interactions, achieves low latency, and demonstrates superior performance with constant-time inference latency in proof-of-concept experiments.

Conclusion: RxT enables real-time, stateful, and economically viable long-form conversations by decoupling response generation from memory updates and fundamentally altering the scaling dynamics.

Abstract: The Transformer architecture has become the de facto standard for Large Language Models (LLMs), demonstrating remarkable capabilities in language understanding and generation. However, its application in conversational AI is fundamentally constrained by its stateless nature and the quadratic computational complexity ($O(L^2)$) with respect to sequence length $L$. Current models emulate memory by reprocessing an ever-expanding conversation history with each turn, leading to prohibitive costs and latency in long dialogues. This paper introduces the Reactive Transformer (RxT), a novel architecture designed to overcome these limitations by shifting from a data-driven to an event-driven paradigm. RxT processes each conversational turn as a discrete event in real-time, maintaining context in an integrated, fixed-size Short-Term Memory (STM) system. The architecture features a distinct operational cycle where a generator-decoder produces a response based on the current query and the previous memory state, after which a memory-encoder and a dedicated Memory Attention network asynchronously update the STM with a representation of the complete interaction. This design fundamentally alters the scaling dynamics, reducing the total user-facing cost of a conversation from quadratic ($O(N^2 \cdot T)$) to linear ($O(N \cdot T)$) with respect to the number of interactions $N$. By decoupling response generation from memory updates, RxT achieves low latency, enabling truly real-time, stateful, and economically viable long-form conversations. We validated our architecture with a series of proof-of-concept experiments on synthetic data, demonstrating superior performance and constant-time inference latency compared to a baseline stateless model of comparable size.

[19] LLM, Reporting In! Medical Information Extraction Across Prompting, Fine-tuning and Post-correction

Ikram Belmadani, Parisa Nazari Hashemi, Thomas Sebbag, Benoit Favre, Guillaume Fortier, Solen Quiniou, Emmanuel Morin, Richard Dufour

Main category: cs.CL

TL;DR: The paper presents three approaches for biomedical NER and event extraction in French using LLMs, with GPT-4.1 achieving best results through in-context learning with guideline summaries.

DetailsMotivation: To address biomedical NER and health event extraction in French under few-shot settings, leveraging LLMs to overcome data scarcity.

Method: Three approaches: (1) GPT-4.1 with in-context learning using guideline summaries and example selection, (2) GLiNER fine-tuned on synthetic data with LLM verification, (3) LLaMA-3.1-8B fine-tuned on synthetic data. Event extraction uses GPT-4.1 with similar ICL strategy.

Result: GPT-4.1 achieved macro-F1 of 61.53% for NER and 15.02% for event extraction, outperforming other methods.

Conclusion: Well-crafted prompting with guideline summaries is crucial for maximizing LLM performance in very low-resource biomedical text processing scenarios.

Abstract: This work presents our participation in the EvalLLM 2025 challenge on biomedical Named Entity Recognition (NER) and health event extraction in French (few-shot setting). For NER, we propose three approaches combining large language models (LLMs), annotation guidelines, synthetic data, and post-processing: (1) in-context learning (ICL) with GPT-4.1, incorporating automatic selection of 10 examples and a summary of the annotation guidelines into the prompt, (2) the universal NER system GLiNER, fine-tuned on a synthetic corpus and then verified by an LLM in post-processing, and (3) the open LLM LLaMA-3.1-8B-Instruct, fine-tuned on the same synthetic corpus. Event extraction uses the same ICL strategy with GPT-4.1, reusing the guideline summary in the prompt. Results show GPT-4.1 leads with a macro-F1 of 61.53% for NER and 15.02% for event extraction, highlighting the importance of well-crafted prompting to maximize performance in very low-resource scenarios.

[20] Decoupling Task-Solving and Output Formatting in LLM Generation

Haikang Deng, Po-Nien Kung, Nanyun Peng

Main category: cs.CL

TL;DR: Deco-G is a decoding framework that separates format adherence from task solving in LLMs, using a tractable probabilistic model for format compliance while prompting LLMs only with task instructions, achieving 1.0-6.0% performance gains with guaranteed format compliance.

DetailsMotivation: LLMs struggle with complex prompts that mix task-solving instructions with rigid formatting requirements, creating competing goals that degrade performance. The entanglement of what to solve and how to present the solution suggests explicit separation could improve results.

Method: Deco-G decouples format adherence from task solving using a separate tractable probabilistic model (TPM) for format compliance. It combines next token probabilities from the LLM with TPM-calculated format compliance likelihood at each decoding step. Key innovations include instruction-aware distillation, flexible trie-building algorithm, and HMM state pruning for efficiency.

Result: The approach demonstrates effectiveness across mathematical reasoning, LLM-as-a-judge, and event argument extraction tasks, achieving 1.0% to 6.0% relative performance gains over regular prompting while guaranteeing format compliance.

Conclusion: Explicitly decoupling format adherence from task solving through the Deco-G framework significantly improves LLM performance on complex tasks with rigid formatting requirements, providing a practical and scalable solution for modern instruction-tuned models.

Abstract: Large language models (LLMs) are increasingly adept at following instructions containing task descriptions to solve complex problems, such as mathematical reasoning and automatic evaluation (LLM-as-a-Judge). However, as prompts grow more complex, models often struggle to adhere to all instructions. This difficulty is especially common when instructive prompts intertwine reasoning directives – specifying what the model should solve – with rigid formatting requirements that dictate how the solution must be presented. The entanglement creates competing goals for the model, suggesting that more explicit separation of these two aspects could lead to improved performance. To this front, we introduce Deco-G, a decoding framework that explicitly decouples format adherence from task solving. Deco-G handles format compliance with a separate tractable probabilistic model (TPM), while prompts LLMs with only task instructions. At each decoding step, Deco-G combines next token probabilities from the LLM with the TPM calculated format compliance likelihood to form the output probability. To make this approach both practical and scalable for modern instruction-tuned LLMs, we introduce three key innovations: instruction-aware distillation, a flexible trie-building algorithm, and HMM state pruning for computational efficiency. We demonstrate the effectiveness of Deco-G across a wide range of tasks with diverse format requirements, including mathematical reasoning, LLM-as-a-judge, and event argument extraction. Overall, our approach yields 1.0% to 6.0% relative gain over regular prompting practice with guaranteed format compliance.

[21] Can an LLM Induce a Graph? Investigating Memory Drift and Context Length

Raquib Bin Yousuf, Aadyant Khatri, Shengzhe Xu, Mandar Sharma, Naren Ramakrishnan

Main category: cs.CL

TL;DR: LLMs exhibit memory drift and contextual forgetting at shorter effective lengths in complex relational reasoning tasks compared to existing benchmarks, revealing limitations in abstracting structured knowledge from unstructured text.

DetailsMotivation: Current evaluation benchmarks for LLMs rely on simplistic retrieval tasks that don't accurately reflect performance in information-dense scenarios requiring complex reasoning.

Method: Evaluate LLMs on complex reasoning tasks requiring induction of structured relational knowledge (graphs) from noisy natural language content with long contexts and irrelevant information.

Result: LLMs show memory drift and contextual forgetting at much shorter effective lengths than existing benchmarks suggest, even reasoning-specialized models like OpenAI o1 remain vulnerable.

Conclusion: Significant limitations exist in LLMs’ ability to abstract structured knowledge from unstructured input, highlighting need for architectural adaptations to improve long-range reasoning.

Abstract: Recently proposed evaluation benchmarks aim to characterize the effective context length and the forgetting tendencies of large language models (LLMs). However, these benchmarks often rely on simplistic ’needle in a haystack' retrieval or continuation tasks that may not accurately reflect the performance of these models in information-dense scenarios. Thus, rather than simple next token prediction, we argue for evaluating these models on more complex reasoning tasks that requires them to induce structured relational knowledge from the text - such as graphs from potentially noisy natural language content. While the input text can be viewed as generated in terms of a graph, its structure is not made explicit and connections must be induced from distributed textual cues, separated by long contexts and interspersed with irrelevant information. Our findings reveal that LLMs begin to exhibit memory drift and contextual forgetting at much shorter effective lengths when tasked with this form of relational reasoning, compared to what existing benchmarks suggest. With these findings, we offer recommendations for the optimal use of popular LLMs for complex reasoning tasks. We further show that even models specialized for reasoning, such as OpenAI o1, remain vulnerable to early memory drift in these settings. These results point to significant limitations in the models' ability to abstract structured knowledge from unstructured input and highlight the need for architectural adaptations to improve long-range reasoning.

[22] Towards Unsupervised Speech Recognition at the Syllable-Level

Liming Wang, Junrui Ni, Kai-Wei Chang, Saurabhchand Bhati, David Harwath, Mark Hasegawa-Johnson, James R. Glass

Main category: cs.CL

TL;DR: A syllable-level unsupervised speech recognition framework using masked language modeling that eliminates the need for G2P converters and addresses training instability in GAN-based methods, achieving significant CER reduction and better generalization to challenging languages like Mandarin.

DetailsMotivation: To enable speech recognition for low-resource languages and multimodal learning from non-parallel data by overcoming limitations of existing phone-based approaches that rely on costly G2P converters and struggle with ambiguous phoneme boundaries.

Method: Syllable-level unsupervised speech recognition framework based on masked language modeling, avoiding G2P dependency and GAN-based training instability.

Result: Achieves up to 40% relative reduction in character error rate on LibriSpeech and effectively generalizes to Mandarin, which was particularly difficult for prior methods.

Conclusion: The syllable-level masked language modeling approach provides a more stable and effective solution for unsupervised speech recognition, eliminating resource dependencies and improving performance on challenging languages.

Abstract: Training speech recognizers with unpaired speech and text – known as unsupervised speech recognition (UASR) – is a crucial step toward extending ASR to low-resource languages in the long-tail distribution and enabling multimodal learning from non-parallel data. However, existing approaches based on phones often rely on costly resources such as grapheme-to-phoneme converters (G2Ps) and struggle to generalize to languages with ambiguous phoneme boundaries due to training instability. In this paper, we address both challenges by introducing a syllable-level UASR framework based on masked language modeling, which avoids the need for G2P and the instability of GAN-based methods. Our approach achieves up to a 40% relative reduction in character error rate (CER) on LibriSpeech and generalizes effectively to Mandarin, a language that has remained particularly difficult for prior methods. Code will be released upon acceptance.

[23] UNIDOC-BENCH: A Unified Benchmark for Document-Centric Multimodal RAG

Xiangyu Peng, Cab Qin, Zeyuan Chen, Ran Xu, Caiming Xiong, Chien-Sheng Wu

Main category: cs.CL

TL;DR: UniDoc-Bench is the first large-scale, realistic benchmark for multimodal retrieval-augmented generation (MM-RAG) built from 70k real-world PDF pages across eight domains, featuring 1,600 multimodal QA pairs and supporting apples-to-apples comparison across four retrieval paradigms.

DetailsMotivation: Current MM-RAG evaluations are fragmented, focusing on either text or images in isolation or on simplified multimodal setups that fail to capture document-centric multimodal use cases.

Method: Developed a pipeline that extracts and links evidence from text, tables, and figures from real-world PDFs, then generates multimodal QA pairs spanning factual retrieval, comparison, summarization, and logical reasoning queries, with 20% validation by multiple annotators and expert adjudication.

Result: Multimodal text-image fusion RAG systems consistently outperform both unimodal and jointly multimodal embedding-based retrieval, indicating that neither text nor images alone are sufficient and that current multimodal embeddings remain inadequate.

Conclusion: The benchmark reveals when and how visual context complements textual evidence, uncovers systematic failure modes, and offers actionable guidance for developing more robust MM-RAG pipelines.

Abstract: Multimodal retrieval-augmented generation (MM-RAG) is a key approach for applying large language models (LLMs) and agents to real-world knowledge bases, yet current evaluations are fragmented, focusing on either text or images in isolation or on simplified multimodal setups that fail to capture document-centric multimodal use cases. In this paper, we introduce UniDoc-Bench, the first large-scale, realistic benchmark for MM-RAG built from 70k real-world PDF pages across eight domains. Our pipeline extracts and links evidence from text, tables, and figures, then generates 1,600 multimodal QA pairs spanning factual retrieval, comparison, summarization, and logical reasoning queries. To ensure reliability, 20% of QA pairs are validated by multiple annotators and expert adjudication. UniDoc-Bench supports apples-to-apples comparison across four paradigms: (1) text-only, (2) image-only, (3) multimodal text-image fusion, and (4) multimodal joint retrieval – under a unified protocol with standardized candidate pools, prompts, and evaluation metrics. Our experiments show that multimodal text-image fusion RAG systems consistently outperform both unimodal and jointly multimodal embedding-based retrieval, indicating that neither text nor images alone are sufficient and that current multimodal embeddings remain inadequate. Beyond benchmarking, our analysis reveals when and how visual context complements textual evidence, uncovers systematic failure modes, and offers actionable guidance for developing more robust MM-RAG pipelines.

[24] Fine-Tuning Large Language Models with QLoRA for Offensive Language Detection in Roman Urdu-English Code-Mixed Text

Nisar Hussain, Amna Qasim, Gull Mehak, Muhammad Zain, Momina Hafeez, Grigori Sidorov

Main category: cs.CL

TL;DR: Proposes QLoRA-based fine-tuning framework for offensive language detection in Roman Urdu-English code-mixed text, achieving best results with Meta LLaMA 3 8B (91.45 F1 score).

DetailsMotivation: Challenges in processing code-mixed languages like Roman Urdu due to unstated grammar, inconsistent spelling, and scarcity of labeled data for offensive language detection.

Method: Translated Roman Urdu-English dataset to English using Google Translate, then fine-tuned multiple LLMs (LLaMA 3 8B, Mistral 7B, etc.) with QLoRA for memory-efficient adaptation on offensive vs non-offensive classification.

Result: Meta LLaMA 3 8B achieved highest F1 score of 91.45, followed by Mistral 7B at 89.66, both surpassing traditional transformer baselines.

Conclusion: QLoRA enables effective fine-tuning of LLMs for low-resource code-mixed offensive language detection, advancing scalable Roman Urdu moderation and multilingual offensive detection systems.

Abstract: The use of derogatory terms in languages that employ code mixing, such as Roman Urdu, presents challenges for Natural Language Processing systems due to unstated grammar, inconsistent spelling, and a scarcity of labeled data. In this work, we propose a QLoRA based fine tuning framework to improve offensive language detection in Roman Urdu-English text. We translated the Roman Urdu-English code mixed dataset into English using Google Translate to leverage English LLMs, while acknowledging that this translation reduces direct engagement with code mixing features. Our focus is on classification performance using English translated low resource inputs. We fine tuned several transformers and large language models, including Meta LLaMA 3 8B, Mistral 7B v0.1, LLaMA 2 7B, ModernBERT, and RoBERTa, with QLoRA for memory efficient adaptation. Models were trained and evaluated on a manually annotated Roman Urdu dataset for offensive vs non offensive content. Of all tested models, the highest F1 score of 91.45 was attained by Meta LLaMA 3 8B, followed by Mistral 7B at 89.66, surpassing traditional transformer baselines. These results demonstrate the efficacy of QLoRA in fine tuning high performing models for low resource environments such as code mixed offensive language detection, and confirm the potential of LLMs for this task. This work advances a scalable approach to Roman Urdu moderation and paves the way for future multilingual offensive detection systems based on LLMs.

[25] StressTest: Can YOUR Speech LM Handle the Stress?

Iddo Yosha, Gallil Maimon, Yossi Adi

Main category: cs.CL

TL;DR: StressTest benchmark evaluates speech-aware language models’ ability to understand sentence stress patterns that change meaning, showing current models perform poorly despite general capabilities.

DetailsMotivation: Sentence stress plays a crucial role in conveying meaning and intent in speech, but this aspect is largely overlooked in the evaluation and development of speech-aware language models (SLMs).

Method: Introduced StressTest benchmark and created Stress-17k training set using a novel data generation pipeline that simulates meaning changes through stress variation. Developed StresSLM by finetuning models on this dataset.

Result: Existing SLMs perform poorly on sentence stress tasks despite overall capabilities. StresSLM notably outperforms existing models on both sentence stress reasoning and detection, and generalizes well to real recordings.

Conclusion: Sentence stress understanding is a critical but underdeveloped capability in SLMs. The proposed approach successfully addresses this gap and demonstrates significant improvements in stress-based meaning comprehension.

Abstract: Sentence stress refers to emphasis on words within a spoken utterance to highlight or contrast an idea. It is often used to imply an underlying intention not explicitly stated. Recent speech-aware language models (SLMs) have enabled direct audio processing, allowing models to access the full richness of speech to perform audio reasoning tasks such as spoken question answering. Despite the crucial role of sentence stress in shaping meaning and intent, it remains largely overlooked in evaluation and development of SLMs. We address this gap by introducing StressTest, a benchmark designed to evaluate models’ ability to distinguish between meanings of speech based on the stress pattern. We evaluate leading SLMs, and find that despite their overall capabilities, they perform poorly on such tasks. Hence, we propose a novel data generation pipeline, and create Stress-17k, a training set that simulates change of meaning implied by stress variation. Results suggest, that our finetuned model, StresSLM, generalizes well to real recordings and notably outperforms existing SLMs on sentence stress reasoning and detection. Models, code, data, samples - pages.cs.huji.ac.il/adiyoss-lab/stresstest.

[26] Fun-ASR Technical Report

Keyu An, Yanni Chen, Chong Deng, Changfeng Gao, Zhifu Gao, Bo Gong, Xiangang Li, Yabin Li, Xiang Lv, Yunjie Ji, Yiheng Jiang, Bin Ma, Haoneng Luo, Chongjia Ni, Zexu Pan, Yiping Peng, Zhendong Peng, Peiyao Wang, Hao Wang, Wen Wang, Wupeng Wang, Biao Tian, Zhentao Tan, Nan Yang, Bin Yuan, Jieping Ye, Jixing Yu, Qinglin Zhang, Kun Zou, Han Zhao, Shengkui Zhao, Jingren Zhou

Main category: cs.CL

TL;DR: Fun-ASR is a large-scale LLM-based ASR system that combines massive data, large model capacity, LLM integration, and reinforcement learning to achieve state-of-the-art performance while addressing LLM hallucination issues for practical deployment.

DetailsMotivation: To address the problem of LLM hallucination in ASR systems that degrades user experience in real-world applications, and to bridge the performance gap between open-source benchmarks and real industry evaluation sets.

Method: Synergistically combines massive data scaling, large model capacity, LLM integration, and reinforcement learning. Specifically optimized for practical deployment with enhancements in streaming capability, noise robustness, code-switching, hotword customization, and other real-world requirements.

Result: Achieves state-of-the-art performance on real application datasets, demonstrating effectiveness and robustness in practical settings, while outperforming other LLM-based ASR systems on industry evaluation sets.

Conclusion: Fun-ASR successfully addresses LLM hallucination issues and delivers superior performance in real-world ASR applications through production-oriented optimizations, making it suitable for practical deployment across diverse and complex speech recognition scenarios.

Abstract: In recent years, automatic speech recognition (ASR) has witnessed transformative advancements driven by three complementary paradigms: data scaling, model size scaling, and deep integration with large language models (LLMs). However, LLMs are prone to hallucination, which can significantly degrade user experience in real-world ASR applications. In this paper, we present Fun-ASR, a large-scale, LLM-based ASR system that synergistically combines massive data, large model capacity, LLM integration, and reinforcement learning to achieve state-of-the-art performance across diverse and complex speech recognition scenarios. Moreover, Fun-ASR is specifically optimized for practical deployment, with enhancements in streaming capability, noise robustness, code-switching, hotword customization, and satisfying other real-world application requirements. Experimental results show that while most LLM-based ASR systems achieve strong performance on open-source benchmarks, they often underperform on real industry evaluation sets. Thanks to production-oriented optimizations, Fun-ASR achieves state-of-the-art performance on real application datasets, demonstrating its effectiveness and robustness in practical settings.

[27] MedReflect: Teaching Medical LLMs to Self-Improve via Reflective Correction

Yue Huang, Yanyuan Chen, Dexuan Xu, Weihua Yue, Huamin Zhang, Meikang Qiu, Yu Huang

Main category: cs.CL

TL;DR: MedReflect is a framework that enables LLMs to perform medical problem-solving through self-reflection without external retrieval or heavy annotation, achieving improved accuracy with minimal training data.

DetailsMotivation: Existing approaches for medical problem-solving with LLMs rely on external knowledge verification or reasoning datasets, which suffer from retrieval overhead, high annotation costs, and limited performance.

Method: MedReflect generates a single-pass reflection chain including initial hypothesis generation, self-questioning, self-answering, and decision refinement, enabling self-verified and self-reflective reasoning.

Result: With only 2,000 training examples and light fine-tuning, MedReflect achieves notable accuracy improvements across medical benchmarks while significantly reducing annotation requirements.

Conclusion: LLMs can learn to solve specialized medical problems through self-reflection and self-improvement, reducing reliance on external supervision and extensive task-specific fine-tuning data.

Abstract: Medical problem solving demands expert knowledge and intricate reasoning. Recent studies of large language models (LLMs) attempt to ease this complexity by introducing external knowledge verification through retrieval-augmented generation or by training on reasoning datasets. However, these approaches suffer from drawbacks such as retrieval overhead and high annotation costs, and they heavily rely on substituted external assistants to reach limited performance in medical field. In this paper, we introduce MedReflect, a generalizable framework designed to inspire LLMs with a physician-like reflective thinking mode. MedReflect generates a single-pass reflection chain that includes initial hypothesis generation, self-questioning, self-answering and decision refinement. This self-verified and self-reflective nature releases large language model’s latent capability in medical problem-solving without external retrieval or heavy annotation. We demonstrate that MedReflect enables cost-efficient medical dataset construction: with merely 2,000 randomly sampled training examples and a light fine-tuning, this approach achieves notable absolute accuracy improvements across a series of medical benchmarks while cutting annotation requirements. Our results provide evidence that LLMs can learn to solve specialized medical problems via self-reflection and self-improve, reducing reliance on external supervision and extensive task-specific fine-tuning data.

[28] HiKE: Hierarchical Evaluation Framework for Korean-English Code-Switching Speech Recognition

Gio Paik, Yongbeom Kim, Soungmin Lee, Sangmin Ahn, Chanwoo Kim

Main category: cs.CL

TL;DR: HiKE is the first Korean-English code-switching benchmark providing natural CS data with hierarchical labeling for systematic evaluation of multilingual ASR models.

DetailsMotivation: Code-switching remains an underexplored challenge in multilingual ASR despite being common in daily speech.

Method: Created HiKE benchmark with high-quality natural CS data, loanword labels, and hierarchical CS-level labeling (word, phrase, sentence).

Result: Most multilingual ASR models initially show inadequate CS-ASR performance but can be improved through fine-tuning with synthetic CS data.

Conclusion: HiKE enables systematic evaluation of code-switching capabilities in ASR models and demonstrates that CS performance can be enabled through targeted fine-tuning.

Abstract: Despite advances in multilingual automatic speech recognition (ASR), code-switching (CS), the mixing of languages within an utterance common in daily speech, remains a severely underexplored challenge. In this paper, we introduce HiKE: the Hierarchical Korean-English code-switching benchmark, the first globally accessible evaluation framework for Korean-English CS, aiming to provide a means for the precise evaluation of multilingual ASR models and to foster research in the field. The proposed framework not only consists of high-quality, natural CS data across various topics, but also provides meticulous loanword labels and a hierarchical CS-level labeling scheme (word, phrase, and sentence) that together enable a systematic evaluation of a model’s ability to handle each distinct level of code-switching. Through evaluations of diverse multilingual ASR models and fine-tuning experiments, this paper demonstrates that although most multilingual ASR models initially exhibit inadequate CS-ASR performance, this capability can be enabled through fine-tuning with synthetic CS data. HiKE is available at https://github.com/ThetaOne-AI/HiKE

[29] TreePrompt: Leveraging Hierarchical Few-Shot Example Selection for Improved English-Persian and English-German Translation

Ramtin Kakavand, Ebrahim Ansari

Main category: cs.CL

TL;DR: TreePrompt is a novel example selection method that learns LLM preferences to identify high-quality, contextually relevant examples for few-shot machine translation, improving translation performance when combined with AFSP or Random selection.

DetailsMotivation: Existing example selection methods for few-shot prompting in machine translation focus only on query-to-example similarity and ignore example quality, limiting translation performance.

Method: Proposed TreePrompt approach learns LLM preferences to select high-quality, contextually relevant examples within a tree-structured framework, and combines it with K-NN and Adaptive Few-Shot Prompting (AFSP).

Result: Evaluations on English-Persian (MIZAN) and English-German (WMT19) show that integrating TreePrompt with AFSP or Random selection leads to improved translation performance.

Conclusion: TreePrompt effectively balances similarity and quality in example selection, enhancing few-shot machine translation performance when combined with appropriate selection strategies.

Abstract: Large Language Models (LLMs) have consistently demonstrated strong performance in machine translation, especially when guided by high-quality prompts. Few-shot prompting is an effective technique to improve translation quality; however, most existing example selection methods focus solely on query-to-example similarity and do not account for the quality of the examples. In this work, we propose TreePrompt, a novel example selection approach that learns LLM preferences to identify high-quality, contextually relevant examples within a tree-structured framework. To further explore the balance between similarity and quality, we combine TreePrompt with K-Nearest Neighbors (K-NN) and Adaptive Few-Shot Prompting (AFSP). Evaluations on two language pairs - English-Persian (MIZAN) and English-German (WMT19) - show that integrating TreePrompt with AFSP or Random selection leads to improved translation performance.

[30] Prompt Balance Matters: Understanding How Imbalanced Few-Shot Learning Affects Multilingual Sense Disambiguation in LLMs

Deshan Sumanathilaka, Nicholas Micallef, Julian Hough

Main category: cs.CL

TL;DR: This study examines how imbalanced few-shot examples in prompting affect Word Sense Disambiguation across multiple languages, finding that multilingual models are sensitive to sample distribution while English models are not.

DetailsMotivation: To investigate the impact of few-shot prompting strategies on Word Sense Disambiguation, particularly focusing on biases from imbalanced sample distributions in multilingual contexts.

Method: Used GLOSSGPT prompting method for English WSD and tested across five languages (English, German, Spanish, French, Italian) with GPT-4o and LLaMA-3.1-70B models.

Result: Imbalanced few-shot examples cause incorrect sense predictions in multilingual languages but not in English. Both GPT-4o and LLaMA-3.1-70B show sensitivity to sample distribution in multilingual WSD.

Conclusion: Multilingual Word Sense Disambiguation is highly sensitive to sample distribution in few-shot settings, emphasizing the need for balanced and representative prompting strategies.

Abstract: Recent advances in Large Language Models (LLMs) have significantly reshaped the landscape of Natural Language Processing (NLP). Among the various prompting techniques, few-shot prompting has gained considerable attention for its practicality and effectiveness. This study investigates how few-shot prompting strategies impact the Word Sense Disambiguation (WSD) task, particularly focusing on the biases introduced by imbalanced sample distributions. We use the GLOSSGPT prompting method, an advanced approach for English WSD, to test its effectiveness across five languages: English, German, Spanish, French, and Italian. Our results show that imbalanced few-shot examples can cause incorrect sense predictions in multilingual languages, but this issue does not appear in English. To assess model behavior, we evaluate both the GPT-4o and LLaMA-3.1-70B models and the results highlight the sensitivity of multilingual WSD to sample distribution in few-shot settings, emphasizing the need for balanced and representative prompting strategies.

[31] Rezwan: Leveraging Large Language Models for Comprehensive Hadith Text Processing: A 1.2M Corpus Development

Majid Asgari-Bidhendi, Muhammad Amin Ghaseminia, Alireza Shahbazi, Sayyed Ali Hossayni, Najmeh Torabian, Behrouz Minaei-Bidgoli

Main category: cs.CL

TL;DR: Rezwan is a large-scale AI-assisted Hadith corpus with 1.2M narrations created through an automated pipeline using LLMs for segmentation, validation, and multi-layer enrichment including translation, diacritization, summarization, and semantic analysis.

DetailsMotivation: To create a research-ready infrastructure for digital humanities and Islamic studies by transforming raw Hadith texts into a richly annotated, multilingual corpus using AI to overcome the limitations of manual curation.

Method: Automated pipeline using LLMs for segmentation, chain-text separation, validation, and multi-layer enrichment including machine translation into 12 languages, diacritization, abstractive summarization, thematic tagging, and cross-text semantic analysis.

Result: Near-human accuracy in structured tasks (9.33/10 for chain-text separation and summarization), superior to manually curated Noor Corpus (8.46/10 vs 3.66/10), with economic feasibility - tasks requiring 229,000+ expert hours completed within months at fraction of cost.

Conclusion: AI can augment human expertise to enable large-scale, multilingual, semantically enriched access to Islamic heritage, introducing a new paradigm in religious text processing.

Abstract: This paper presents the development of Rezwan, a large-scale AI-assisted Hadith corpus comprising over 1.2M narrations, extracted and structured through a fully automated pipeline. Building on digital repositories such as Maktabat Ahl al-Bayt, the pipeline employs Large Language Models (LLMs) for segmentation, chain–text separation, validation, and multi-layer enrichment. Each narration is enhanced with machine translation into twelve languages, intelligent diacritization, abstractive summarization, thematic tagging, and cross-text semantic analysis. This multi-step process transforms raw text into a richly annotated research-ready infrastructure for digital humanities and Islamic studies. A rigorous evaluation was conducted on 1,213 randomly sampled narrations, assessed by six domain experts. Results show near-human accuracy in structured tasks such as chain–text separation (9.33/10) and summarization (9.33/10), while highlighting ongoing challenges in diacritization and semantic similarity detection. Comparative analysis against the manually curated Noor Corpus demonstrates the superiority of Najm in both scale and quality, with a mean overall score of 8.46/10 versus 3.66/10. Furthermore, cost analysis confirms the economic feasibility of the AI approach: tasks requiring over 229,000 hours of expert labor were completed within months at a fraction of the cost. The work introduces a new paradigm in religious text processing by showing how AI can augment human expertise, enabling large-scale, multilingual, and semantically enriched access to Islamic heritage.

[32] Mechanistic Interpretability of Socio-Political Frames in Language Models

Hadi Asghari, Sami Nenno

Main category: cs.CL

TL;DR: LLMs can generate and recognize deep cognitive frames like ‘strict father’ and ’nurturing parent’ in socio-political contexts, with specific dimensions in hidden representations correlating with these frames.

DetailsMotivation: To understand how large language models capture and express meaningful human concepts, particularly deep cognitive frames in socio-political contexts.

Method: Used mechanistic interpretability to investigate frame locations in model’s hidden representations, identifying singular dimensions that strongly correlate with specific frames in zero-shot settings.

Result: LLMs demonstrate high fluency in generating texts that evoke specific frames and can recognize these frames effectively in zero-shot scenarios.

Conclusion: The research contributes to understanding how LLMs internalize and express meaningful human concepts through identifiable dimensions in their representations.

Abstract: This paper explores the ability of large language models to generate and recognize deep cognitive frames, particularly in socio-political contexts. We demonstrate that LLMs are highly fluent in generating texts that evoke specific frames and can recognize these frames in zero-shot settings. Inspired by mechanistic interpretability research, we investigate the location of the strict father' and nurturing parent’ frames within the model’s hidden representation, identifying singular dimensions that correlate strongly with their presence. Our findings contribute to understanding how LLMs capture and express meaningful human concepts.

[33] Beyond Token Length: Step Pruner for Efficient and Accurate Reasoning in Large Language Models

Canhui Wu, Qiong Cao, Chang Li, Zhenfang Wang, Chao Xue, Yuwei Fan, Wei Xi, Xiaodong He

Main category: cs.CL

TL;DR: Step Pruner (SP) is an RL framework that addresses overthinking in Large Reasoning Models by penalizing redundant reasoning steps rather than tokens, preventing hacking behaviors while maintaining accuracy.

DetailsMotivation: Existing RL methods penalize tokens to reduce verbosity, but this doesn't address reasoning step efficiency and can lead to hacking behaviors where models discard reasoning steps to minimize token usage.

Method: Step Pruner uses a step-aware reward function that prioritizes correctness while penalizing redundant steps, withholds rewards for incorrect responses, and implements dynamic stopping when step length exceeds limits to prevent step merging.

Result: SP achieves state-of-the-art accuracy while significantly reducing response length, with 69.7% token reduction on AIME24 benchmark across four reasoning benchmarks.

Conclusion: The step-level approach effectively addresses overthinking in LRMs by focusing on reasoning step efficiency rather than token count, preventing hacking behaviors while maintaining high accuracy.

Abstract: Large Reasoning Models (LRMs) demonstrate strong performance on complex tasks but often suffer from excessive verbosity, known as “overthinking.” Existing solutions via reinforcement learning (RL) typically penalize generated tokens to promote conciseness. However, these methods encounter two challenges: responses with fewer tokens do not always correspond to fewer reasoning steps, and models may develop hacking behavior in later stages of training by discarding reasoning steps to minimize token usage. In this work, we introduce \textbf{Step Pruner (SP)}, an RL framework that steers LRMs toward more efficient reasoning by favoring compact reasoning steps. Our step-aware reward function prioritizes correctness while imposing penalties for redundant steps, and withholds rewards for incorrect responses to prevent the reinforcement of erroneous reasoning. Moreover, we propose a dynamic stopping mechanism: when the length of any output step exceeds the upper limit, we halt updates to prevent hacking behavior caused by merging steps. Extensive experiments across four reasoning benchmarks demonstrate that SP achieves state-of-the-art accuracy while significantly reducing response length. For instance, on AIME24, SP reduces token usage by \textbf{69.7%}.

[34] Annotate Rhetorical Relations with INCEpTION: A Comparison with Automatic Approaches

Mehedi Hasan Emon

Main category: cs.CL

TL;DR: Comparison of manual vs automatic rhetorical relation annotation using INCEpTION tool and LLMs (BERT, DistilBERT, Logistic Regression) on cricket news, with DistilBERT achieving best performance.

DetailsMotivation: To explore the intersection of discourse parsing and transformer-based NLP by comparing manual annotation with automatic approaches for rhetorical relation classification.

Method: Used INCEpTION tool for manual annotation and evaluated BERT, DistilBERT, and Logistic Regression models on classifying rhetorical relations (elaboration, contrast, background, cause-effect) in cricket news reports.

Result: DistilBERT achieved the highest accuracy among the models tested, demonstrating strong potential for efficient discourse relation prediction.

Conclusion: This work successfully bridges discourse parsing with transformer-based NLP, showing that DistilBERT is particularly effective for rhetorical relation classification tasks.

Abstract: This research explores the annotation of rhetorical relations in discourse using the INCEpTION tool and compares manual annotation with automatic approaches based on large language models. The study focuses on sports reports (specifically cricket news) and evaluates the performance of BERT, DistilBERT, and Logistic Regression models in classifying rhetorical relations such as elaboration, contrast, background, and cause-effect. The results show that DistilBERT achieved the highest accuracy, highlighting its potential for efficient discourse relation prediction. This work contributes to the growing intersection of discourse parsing and transformer-based NLP. (This paper was conducted as part of an academic requirement under the supervision of Prof. Dr. Ralf Klabunde, Linguistic Data Science Lab, Ruhr University Bochum.) Keywords: Rhetorical Structure Theory, INCEpTION, BERT, DistilBERT, Discourse Parsing, NLP.

[35] Read Between the Lines: A Benchmark for Uncovering Political Bias in Bangla News Articles

Nusrat Jahan Lia, Shubhashis Roy Dipta, Abdullah Khan Zehady, Naymul Islam, Madhusodan Chakraborty, Abdullah Al Wasif

Main category: cs.CL

TL;DR: First benchmark dataset for Bangla political bias detection with 200 news articles labeled for government-leaning, government-critique, and neutral stances, plus diagnostic analyses for evaluating LLMs.

DetailsMotivation: Addressing the scarcity of annotated datasets and computational studies for Bangla political bias research, which requires understanding linguistic cues, cultural context, and socio-political background.

Method: Created a labeled dataset of 200 politically significant Bangla news articles and conducted comprehensive evaluation of 28 proprietary and open-source LLMs on stance detection tasks.

Result: LLMs showed strong performance in detecting government-critique content (F1 up to 0.83) but substantial difficulty with neutral articles (F1 as low as 0.00), with models tending to over-predict government-leaning stances.

Conclusion: The dataset and diagnostics provide a foundation for advancing stance detection in Bangla media research and offer insights for improving LLM performance in low-resource languages.

Abstract: Detecting media bias is crucial, specifically in the South Asian region. Despite this, annotated datasets and computational studies for Bangla political bias research remain scarce. Crucially because, political stance detection in Bangla news requires understanding of linguistic cues, cultural context, subtle biases, rhetorical strategies, code-switching, implicit sentiment, and socio-political background. To address this, we introduce the first benchmark dataset of 200 politically significant and highly debated Bangla news articles, labeled for government-leaning, government-critique, and neutral stances, alongside diagnostic analyses for evaluating large language models (LLMs). Our comprehensive evaluation of 28 proprietary and open-source LLMs shows strong performance in detecting government-critique content (F1 up to 0.83) but substantial difficulty with neutral articles (F1 as low as 0.00). Models also tend to over-predict government-leaning stances, often misinterpreting ambiguous narratives. This dataset and its associated diagnostics provide a foundation for advancing stance detection in Bangla media research and offer insights for improving LLM performance in low-resource languages.

[36] PsycholexTherapy: Simulating Reasoning in Psychotherapy with Small Language Models in Persian

Mohammad Amin Abbasi, Hassan Naderi

Main category: cs.CL

TL;DR: PsychoLexTherapy is a framework for simulating psychotherapeutic reasoning in Persian using small language models, optimized for on-device deployment with structured memory for multi-turn interactions.

DetailsMotivation: To develop culturally grounded, therapeutically coherent dialogue systems for underrepresented languages like Persian while ensuring privacy and feasibility through on-device deployment.

Method: Three-stage process: (i) assessing SLMs’ psychological knowledge with PsychoLexEval, (ii) designing PsychoLexTherapy framework with structured memory, (iii) constructing evaluation datasets (PsychoLexQuery and PsychoLexDialogue) and comparing prompting strategies.

Result: Outperformed all baselines in automatic evaluation and human preference studies. Long-term memory module proved essential for multi-turn coherence. Achieved highest ratings in empathy, coherence, cultural fit, and personalization.

Conclusion: Establishes a practical, privacy-preserving, and culturally aligned foundation for Persian psychotherapy simulation with novel datasets and insights into structured memory for therapeutic reasoning.

Abstract: This study presents PsychoLexTherapy, a framework for simulating psychotherapeutic reasoning in Persian using small language models (SLMs). The framework tackles the challenge of developing culturally grounded, therapeutically coherent dialogue systems with structured memory for multi-turn interactions in underrepresented languages. To ensure privacy and feasibility, PsychoLexTherapy is optimized for on-device deployment, enabling use without external servers. Development followed a three-stage process: (i) assessing SLMs psychological knowledge with PsychoLexEval; (ii) designing and implementing the reasoning-oriented PsychoLexTherapy framework; and (iii) constructing two evaluation datasets-PsychoLexQuery (real Persian user questions) and PsychoLexDialogue (hybrid simulated sessions)-to benchmark against multiple baselines. Experiments compared simple prompting, multi-agent debate, and structured therapeutic reasoning paths. Results showed that deliberate model selection balanced accuracy, efficiency, and privacy. On PsychoLexQuery, PsychoLexTherapy outperformed all baselines in automatic LLM-as-a-judge evaluation and was ranked highest by human evaluators in a single-turn preference study. In multi-turn tests with PsychoLexDialogue, the long-term memory module proved essential: while naive history concatenation caused incoherence and information loss, the full framework achieved the highest ratings in empathy, coherence, cultural fit, and personalization. Overall, PsychoLexTherapy establishes a practical, privacy-preserving, and culturally aligned foundation for Persian psychotherapy simulation, contributing novel datasets, a reproducible evaluation pipeline, and empirical insights into structured memory for therapeutic reasoning.

[37] Mapping Patient-Perceived Physician Traits from Nationwide Online Reviews with LLMs

Junjie Luo, Rui Han, Arshana Welivita, Zeleikun Di, Jingfu Wu, Xuzhe Zhi, Ritu Agarwal, Gordon Gao

Main category: cs.CL

TL;DR: LLM-based pipeline analyzes 4.1M patient reviews to infer physicians’ Big Five personality traits and patient perceptions, revealing systematic patterns in physician ratings and identifying four distinct physician archetypes.

DetailsMotivation: To understand how patients perceive physicians to improve trust, communication, and satisfaction in healthcare relationships.

Method: Large language model pipeline analyzing 4.1 million patient reviews of 226,999 U.S. physicians, validated through multi-model comparison and human expert benchmarking.

Result: Strong agreement between human and LLM assessments (correlation 0.72-0.89), systematic patterns showing male physicians receive higher ratings, empathy traits dominate in pediatrics/psychiatry, and identification of four physician archetypes from “Well-Rounded Excellent” to “Underperforming”.

Conclusion: Automated trait extraction from patient narratives provides validated metrics for understanding physician-patient relationships at scale, with implications for quality measurement, bias detection, and healthcare workforce development.

Abstract: Understanding how patients perceive their physicians is essential to improving trust, communication, and satisfaction. We present a large language model (LLM)-based pipeline that infers Big Five personality traits and five patient-oriented subjective judgments. The analysis encompasses 4.1 million patient reviews of 226,999 U.S. physicians from an initial pool of one million. We validate the method through multi-model comparison and human expert benchmarking, achieving strong agreement between human and LLM assessments (correlation coefficients 0.72-0.89) and external validity through correlations with patient satisfaction (r = 0.41-0.81, all p<0.001). National-scale analysis reveals systematic patterns: male physicians receive higher ratings across all traits, with largest disparities in clinical competence perceptions; empathy-related traits predominate in pediatrics and psychiatry; and all traits positively predict overall satisfaction. Cluster analysis identifies four distinct physician archetypes, from “Well-Rounded Excellent” (33.8%, uniformly high traits) to “Underperforming” (22.6%, consistently low). These findings demonstrate that automated trait extraction from patient narratives can provide interpretable, validated metrics for understanding physician-patient relationships at scale, with implications for quality measurement, bias detection, and workforce development in healthcare.

[38] Simulating and Understanding Deceptive Behaviors in Long-Horizon Interactions

Yang Xu, Xuanming Zhang, Min-Hsuan Yeh, Jwala Dhamala, Ousmane Dia, Rahul Gupta, Yixuan Li

Main category: cs.CL

TL;DR: First simulation framework to evaluate LLM deception in multi-turn interactions, revealing deception increases with pressure and erodes trust across 11 models.

DetailsMotivation: Current LLM deception evaluations are limited to single-turn prompts, failing to capture long-horizon interactions where deceptive strategies typically unfold.

Method: Multi-agent simulation with performer agent completing tasks, supervisor agent evaluating progress and trust, and independent deception auditor reviewing full trajectories.

Result: Deception is model-dependent, increases with event pressure, consistently erodes supervisor trust, and reveals distinct strategies of concealment, equivocation, and falsification.

Conclusion: Deception is an emergent risk in long-horizon LLM interactions, providing foundation for evaluating future models in trust-sensitive contexts.

Abstract: Deception is a pervasive feature of human communication and an emerging concern in large language models (LLMs). While recent studies document instances of LLM deception under pressure, most evaluations remain confined to single-turn prompts and fail to capture the long-horizon interactions in which deceptive strategies typically unfold. We introduce the first simulation framework for probing and evaluating deception in LLMs under extended sequences of interdependent tasks and dynamic contextual pressures. Our framework instantiates a multi-agent system: a performer agent tasked with completing tasks and a supervisor agent that evaluates progress, provides feedback, and maintains evolving states of trust. An independent deception auditor then reviews full trajectories to identify when and how deception occurs. We conduct extensive experiments across 11 frontier models, spanning both closed- and open-source systems, and find that deception is model-dependent, increases with event pressure, and consistently erodes supervisor trust. Qualitative analyses further reveal distinct strategies of concealment, equivocation, and falsification. Our findings establish deception as an emergent risk in long-horizon interactions and provide a foundation for evaluating future LLMs in real-world, trust-sensitive contexts.

[39] Named Entity Recognition in COVID-19 tweets with Entity Knowledge Augmentation

Xuankang Zhang, Jiangming Liu

Main category: cs.CL

TL;DR: A novel entity knowledge augmentation approach for COVID-19 named entity recognition that improves performance in both fully-supervised and few-shot settings on social media and biomedical datasets.

DetailsMotivation: COVID-19 pandemic discussions on social media need named entity recognition, but existing methods face challenges due to informal text, limited annotations, and domain-specific knowledge requirements.

Method: Proposed entity knowledge augmentation approach that can be applied to both informal (social media) and formal (biomedical) text formats for named entity recognition.

Result: Experiments on COVID-19 tweets dataset and PubMed dataset show improved NER performance in both fully-supervised and few-shot settings.

Conclusion: The entity knowledge augmentation approach effectively addresses challenges in COVID-19 NER and can be generalized to biomedical named entity recognition tasks.

Abstract: The COVID-19 pandemic causes severe social and economic disruption around the world, raising various subjects that are discussed over social media. Identifying pandemic-related named entities as expressed on social media is fundamental and important to understand the discussions about the pandemic. However, there is limited work on named entity recognition on this topic due to the following challenges: 1) COVID-19 texts in social media are informal and their annotations are rare and insufficient to train a robust recognition model, and 2) named entity recognition in COVID-19 requires extensive domain-specific knowledge. To address these issues, we propose a novel entity knowledge augmentation approach for COVID-19, which can also be applied in general biomedical named entity recognition in both informal text format and formal text format. Experiments carried out on the COVID-19 tweets dataset and PubMed dataset show that our proposed entity knowledge augmentation improves NER performance in both fully-supervised and few-shot settings. Our source code is publicly available: https://github.com/kkkenshi/LLM-EKA/tree/master

[40] AgriGPT-VL: Agricultural Vision-Language Understanding Suite

Bo Yang, Yunkui Chen, Lanfei Feng, Yu Zhang, Xiao Xu, Jianyu Zhang, Nueraili Aierken, Runhe Huang, Hongjian Lin, Yibin Ying, Shijian Li

Main category: cs.CL

TL;DR: AgriGPT-VL Suite is a unified multimodal framework for agriculture featuring the largest vision-language corpus (Agri-3M-VL), a specialized vision-language model (AgriGPT-VL), and a challenging evaluation suite (AgriBench-VL-4K).

DetailsMotivation: Address the scarcity of domain-tailored models, curated vision-language corpora, and rigorous evaluation in agricultural AI applications.

Method: Developed Agri-3M-VL corpus using scalable multi-agent data generator, trained AgriGPT-VL via progressive curriculum of textual grounding, multimodal alignment, and GRPO refinement, and established AgriBench-VL-4K evaluation suite with LLM-as-a-judge framework.

Result: AgriGPT-VL outperforms leading general-purpose VLMs on AgriBench-VL-4K with higher pairwise win rates, while remaining competitive on text-only AgriBench-13K without language ability degradation.

Conclusion: The framework demonstrates strong multimodal reasoning capabilities for agriculture while preserving text-only performance, with ablation studies confirming consistent gains from alignment and GRPO refinement stages.

Abstract: Despite rapid advances in multimodal large language models, agricultural applications remain constrained by the scarcity of domain-tailored models, curated vision-language corpora, and rigorous evaluation. To address these challenges, we present the AgriGPT-VL Suite, a unified multimodal framework for agriculture. Our contributions are threefold. First, we introduce Agri-3M-VL, the largest vision-language corpus for agriculture to our knowledge, curated by a scalable multi-agent data generator; it comprises 1M image-caption pairs, 2M image-grounded VQA pairs, 50K expert-level VQA instances, and 15K GRPO reinforcement learning samples. Second, we develop AgriGPT-VL, an agriculture-specialized vision-language model trained via a progressive curriculum of textual grounding, multimodal shallow/deep alignment, and GRPO refinement. This method achieves strong multimodal reasoning while preserving text-only capability. Third, we establish AgriBench-VL-4K, a compact yet challenging evaluation suite with open-ended and image-grounded questions, paired with multi-metric evaluation and an LLM-as-a-judge framework. Experiments show that AgriGPT-VL outperforms leading general-purpose VLMs on AgriBench-VL-4K, achieving higher pairwise win rates in the LLM-as-a-judge evaluation. Meanwhile, it remains competitive on the text-only AgriBench-13K with no noticeable degradation of language ability. Ablation studies further confirm consistent gains from our alignment and GRPO refinement stages. We will open source all of the resources to support reproducible research and deployment in low-resource agricultural settings.

[41] LLM Microscope: What Model Internals Reveal About Answer Correctness and Context Utilization

Jiarui Liu, Jivitesh Jain, Mona Diab, Nishant Subramani

Main category: cs.CL

TL;DR: Using model activations to predict LLM output correctness and context effectiveness, achieving 75% accuracy for early auditing and better context quality assessment than prompting methods.

DetailsMotivation: LLMs often generate incorrect information with high confidence, and there's a need to identify when queries benefit from retrieved context and assess context effectiveness to improve trustworthiness.

Method: Operationalize interpretability methods to predict correctness from model activations alone. Use simple classifier on intermediate layer activations of first output token. Introduce metrics to distinguish correct, incorrect, and irrelevant context.

Result: Classifier achieves ~75% accuracy predicting output correctness from activations. Model-internals-based metric significantly outperforms prompting baselines at distinguishing correct vs incorrect context.

Conclusion: Model internals contain signals about output correctness and context efficacy, offering insights into LLM decision-making processes and enabling better trustworthiness assessment.

Abstract: Although large language models (LLMs) have tremendous utility, trustworthiness is still a chief concern: models often generate incorrect information with high confidence. While contextual information can help guide generation, identifying when a query would benefit from retrieved context and assessing the effectiveness of that context remains challenging. In this work, we operationalize interpretability methods to ascertain whether we can predict the correctness of model outputs from the model’s activations alone. We also explore whether model internals contain signals about the efficacy of external context. We consider correct, incorrect, and irrelevant context and introduce metrics to distinguish amongst them. Experiments on six different models reveal that a simple classifier trained on intermediate layer activations of the first output token can predict output correctness with about 75% accuracy, enabling early auditing. Our model-internals-based metric significantly outperforms prompting baselines at distinguishing between correct and incorrect context, guarding against inaccuracies introduced by polluted context. These findings offer a lens to better understand the underlying decision-making processes of LLMs. Our code is publicly available at https://github.com/jiarui-liu/LLM-Microscope

[42] Thai Semantic End-of-Turn Detection for Real-Time Voice Agents

Thanapol Popit, Natthapath Rungseesiripak, Monthol Charattrakool, Saksorn Ruangtanusak

Main category: cs.CL

TL;DR: First systematic study of Thai text-only end-of-turn detection for real-time agents, comparing zero-shot/few-shot LLM prompting with supervised fine-tuning of lightweight transformers.

DetailsMotivation: Traditional audio-silence end-pointers add significant delay and fail under hesitations or language-specific phenomena, requiring more reliable and low-latency detection for fluid voice-to-voice interaction.

Method: Used transcribed subtitles from YODAS corpus with Thai-specific linguistic cues (sentence-final particles), formulated EOT as binary decision over token boundaries, compared zero-shot/few-shot prompting of compact LLMs with supervised fine-tuning of lightweight transformers.

Result: Clear accuracy-latency tradeoff observed, with small fine-tuned models capable of delivering near-instant EOT decisions suitable for on-device agents.

Conclusion: Established Thai baseline and demonstrated that small, fine-tuned models can provide near-instant end-of-turn detection for real-time voice agents.

Abstract: Fluid voice-to-voice interaction requires reliable and low-latency detection of when a user has finished speaking. Traditional audio-silence end-pointers add hundreds of milliseconds of delay and fail under hesitations or language-specific phenomena. We present, to our knowledge, the first systematic study of Thai text-only end-of-turn (EOT) detection for real-time agents. We compare zero-shot and few-shot prompting of compact LLMs to supervised fine-tuning of lightweight transformers. Using transcribed subtitles from the YODAS corpus and Thai-specific linguistic cues (e.g., sentence-final particles), we formulate EOT as a binary decision over token boundaries. We report a clear accuracy-latency tradeoff and provide a public-ready implementation plan. This work establishes a Thai baseline and demonstrates that small, fine-tuned models can deliver near-instant EOT decisions suitable for on-device agents.

[43] Does Using Counterfactual Help LLMs Explain Textual Importance in Classification?

Nelvin Tan, James Asikin Cheung, Yu-Ching Shih, Dong Yang, Amol Salunkhe

Main category: cs.CL

TL;DR: A framework using counterfactuals to explain LLM classification decisions by measuring how word importance changes when counterfactuals are introduced.

DetailsMotivation: LLMs are increasingly used for classification but are black-boxed and expensive to call, creating need for efficient explanation methods.

Method: Proposed decision changing rate framework that quantifies word importance by incorporating counterfactuals into LLM reasoning.

Result: Experimental results demonstrate that using counterfactuals can be helpful for identifying important words in classification decisions.

Conclusion: Counterfactuals provide an effective approach for explaining LLM classification decisions under practical constraints.

Abstract: Large language models (LLMs) are becoming useful in many domains due to their impressive abilities that arise from large training datasets and large model sizes. More recently, they have been shown to be very effective in textual classification tasks, motivating the need to explain the LLMs’ decisions. Motivated by practical constrains where LLMs are black-boxed and LLM calls are expensive, we study how incorporating counterfactuals into LLM reasoning can affect the LLM’s ability to identify the top words that have contributed to its classification decision. To this end, we introduce a framework called the decision changing rate that helps us quantify the importance of the top words in classification. Our experimental results show that using counterfactuals can be helpful.

[44] Small Language Models for Emergency Departments Decision Support: A Benchmark Study

Zirui Wang, Jiajun Wu, Braden Teitge, Jessalyn Holodinsky, Steve Drew

Main category: cs.CL

TL;DR: Small language models (SLMs) show strong potential for emergency department decision support, with general-domain SLMs surprisingly outperforming medically fine-tuned models across multiple medical benchmarks.

DetailsMotivation: Emergency departments require efficient AI tools due to fast-paced, high-stakes environments. SLMs offer practical advantages over LLMs for real-world deployment due to hardware limitations, cost constraints, and privacy concerns.

Method: Comprehensive benchmark using MedMCQA, MedQA-4Options, and PubMedQA datasets to evaluate SLMs trained on mixed general-domain and medical corpora, with medical abstracts simulating real ED physician tasks.

Result: General-domain SLMs outperformed medically fine-tuned counterparts across all benchmarks, suggesting specialized medical fine-tuning may not be necessary for ED applications.

Conclusion: For emergency department decision support, general-domain small language models are sufficient and specialized medical fine-tuning may not provide additional benefits.

Abstract: Large language models (LLMs) have become increasingly popular in medical domains to assist physicians with a variety of clinical and operational tasks. Given the fast-paced and high-stakes environment of emergency departments (EDs), small language models (SLMs), characterized by a reduction in parameter count compared to LLMs, offer significant potential due to their inherent reasoning capability and efficient performance. This enables SLMs to support physicians by providing timely and accurate information synthesis, thereby improving clinical decision-making and workflow efficiency. In this paper, we present a comprehensive benchmark designed to identify SLMs suited for ED decision support, taking into account both specialized medical expertise and broad general problem-solving capabilities. In our evaluations, we focus on SLMs that have been trained on a mixture of general-domain and medical corpora. A key motivation for emphasizing SLMs is the practical hardware limitations, operational cost constraints, and privacy concerns in the typical real-world deployments. Our benchmark datasets include MedMCQA, MedQA-4Options, and PubMedQA, with the medical abstracts dataset emulating tasks aligned with real ED physicians’ daily tasks. Experimental results reveal that general-domain SLMs surprisingly outperform their medically fine-tuned counterparts across these diverse benchmarks for ED. This indicates that for ED, specialized medical fine-tuning of the model may not be required.

[45] Exploring Chain-of-Thought Reasoning for Steerable Pluralistic Alignment

Yunfan Zhang, Kathleen McKeown, Smaranda Muresan

Main category: cs.CL

TL;DR: This paper investigates using Chain-of-Thought reasoning techniques to enable large language models to support steerable pluralism - the ability to adopt specific perspectives and align outputs accordingly.

DetailsMotivation: Current LLMs reflect uniform values, limiting their applicability for tasks requiring nuanced human perspectives. There's a need for models that can understand and adopt diverse viewpoints.

Method: Explored several CoT approaches: CoT prompting, fine-tuning on human-authored CoT, fine-tuning on synthetic explanations, and Reinforcement Learning with Verifiable Rewards (RLVR). Evaluated on Value Kaleidoscope and OpinionQA datasets.

Result: RLVR consistently outperformed other methods and demonstrated strong training sample efficiency. The generated CoT traces were analyzed for faithfulness and safety.

Conclusion: Chain-of-Thought reasoning techniques, particularly RLVR, can effectively build steerable pluralistic models that can adopt specific perspectives while maintaining faithfulness and safety.

Abstract: Large Language Models (LLMs) are typically trained to reflect a relatively uniform set of values, which limits their applicability to tasks that require understanding of nuanced human perspectives. Recent research has underscored the importance of enabling LLMs to support steerable pluralism – the capacity to adopt a specific perspective and align generated outputs with it. In this work, we investigate whether Chain-of-Thought (CoT) reasoning techniques can be applied to building steerable pluralistic models. We explore several methods, including CoT prompting, fine-tuning on human-authored CoT, fine-tuning on synthetic explanations, and Reinforcement Learning with Verifiable Rewards (RLVR). We evaluate these approaches using the Value Kaleidoscope and OpinionQA datasets. Among the methods studied, RLVR consistently outperforms others and demonstrates strong training sample efficiency. We further analyze the generated CoT traces with respect to faithfulness and safety.

[46] What Makes Diffusion Language Models Super Data Learners?

Zitian Gao, Haoming Luo, Lynx Chen, Jason Klein Liu, Ran Tao, Joey Zhou, Bryan Dai

Main category: cs.CL

TL;DR: Random masking in diffusion language models is the key factor for data efficiency, and similar gains can be achieved through MLP dropout and weight decay.

DetailsMotivation: To understand why diffusion language models achieve remarkable data efficiency under limited-data constraints, as the underlying mechanisms remain unclear.

Method: Performed extensive ablation experiments to disentangle the sources of data efficiency in diffusion language models.

Result: Random masking of input tokens plays the dominant role in data efficiency, and similar gains can be obtained through MLP dropout and weight decay.

Conclusion: Stochastic regularization broadly enhances data efficiency in multi-epoch training for language models.

Abstract: Recent studies have shown that diffusion language models achieve remarkable data efficiency under limited-data constraints, yet the underlying mechanisms remain unclear. In this work, we perform extensive ablation experiments to disentangle the sources of this efficiency. Our results show that random masking of input tokens plays the dominant role. We further show that similar gains can be obtained through in MLP dropout and weight decay, indicating that stochastic regularization broadly enhances data efficiency in multi-epoch training. Our code is available at https://github.com/zitian-gao/data-efficiency.

[47] PoLi-RL: A Point-to-List Reinforcement Learning Framework for Conditional Semantic Textual Similarity

Zixin Song, Bowen Zhang, Qian-Wen Zhang, Di Yin, Xing Sun, Chunping Li

Main category: cs.CL

TL;DR: PoLi-RL is a novel Point-to-List Reinforcement Learning framework that achieves state-of-the-art performance on Conditional Semantic Textual Similarity (C-STS) by using a two-stage curriculum with hybrid rewards and parallel slice ranking.

DetailsMotivation: Existing C-STS methods are limited to discriminative models and fail to leverage recent advances in LLMs and RL. RL is well-suited for C-STS as it can directly optimize the non-differentiable Spearman metric and guide reasoning, but naive listwise RL fails due to complex reward signals.

Method: PoLi-RL uses a two-stage curriculum: first training with pointwise rewards for basic scoring, then transitioning to hybrid rewards combining pointwise, pairwise, and listwise objectives. It introduces Parallel Slice Ranking Reward (PSRR) that computes ranking rewards in parallel slices for granular credit assignment.

Result: PoLi-RL achieves Spearman correlation coefficient of 48.18 on the official C-STS benchmark, establishing new SOTA for cross-encoder architecture.

Conclusion: As the first successful application of RL to C-STS, PoLi-RL introduces a powerful paradigm for training LLMs on complex, ranking-based conditional judgment tasks with precise optimization.

Abstract: Conditional Semantic Textual Similarity (C-STS) measures the semantic proximity between text segments under a specific condition, thereby overcoming the ambiguity inherent in traditional STS. However, existing methods are largely confined to discriminative models, failing to fully integrate recent breakthroughs in the NLP community concerning Large Language Models (LLMs) and Reinforcement Learning (RL). RL is a particularly well-suited paradigm for this task, as it can directly optimize the non-differentiable Spearman ranking metric and guide the reasoning process required by C-STS. However, we find that naively applying listwise RL fails to produce meaningful improvements, as the model is overwhelmed by complex, coarse-grained reward signals. To address this challenge, we introduce PoLi-RL, a novel Point-to-List Reinforcement Learning framework. PoLi-RL employs a two-stage curriculum: it first trains the model with simple pointwise rewards to establish fundamental scoring capabilities, then transitions to a hybrid reward that combines pointwise, pairwise, and listwise objectives to refine the model’s ability to discern subtle semantic distinctions. Crucially, we propose an innovative Parallel Slice Ranking Reward (PSRR) mechanism that computes ranking rewards in parallel slices, where each slice comprises same-indexed completions from different samples. This provides a precise, differentiated learning signal for each individual completion, enabling granular credit assignment and effective optimization. On the official C-STS benchmark, PoLi-RL achieves a Spearman correlation coefficient of 48.18, establishing a new SOTA for the cross-encoder architecture. As the first work to successfully apply RL to C-STS, our study introduces a powerful and precise paradigm for training LLMs on complex, ranking-based conditional judgment tasks.

[48] Scaling Code-Assisted Chain-of-Thoughts and Instructions for Model Reasoning

Honglin Lin, Qizhi Pei, Xin Gao, Zhuoshi Pan, Yu Li, Juntao Li, Conghui He, Lijun Wu

Main category: cs.CL

TL;DR: Caco is a code-assisted framework that automates the synthesis of high-quality, verifiable reasoning data through code-driven augmentation, enabling scalable reasoning without human intervention.

DetailsMotivation: Existing Chain-of-Thought methods suffer from uncontrolled generation, insufficient quality, and limited diversity in reasoning paths, while code-based approaches are constrained to predefined mathematical problems.

Method: Fine-tunes a code-based CoT generator on existing solutions in unified code format, then scales data generation with automated validation via code execution and rule-based filtering, followed by reverse-engineering into natural language instructions.

Result: Experiments on Caco-1.3M dataset show Caco-trained models achieve strong competitive performance on mathematical reasoning benchmarks, outperforming existing baselines with superior generalization across unseen tasks.

Conclusion: Caco establishes a paradigm for building self-sustaining, trustworthy reasoning systems through fully automated, scalable synthesis of reasoning data with guaranteed executability.

Abstract: Reasoning capability is pivotal for Large Language Models (LLMs) to solve complex tasks, yet achieving reliable and scalable reasoning remains challenging. While Chain-of-Thought (CoT) prompting has become a mainstream approach, existing methods often suffer from uncontrolled generation, insufficient quality, and limited diversity in reasoning paths. Recent efforts leverage code to enhance CoT by grounding reasoning in executable steps, but such methods are typically constrained to predefined mathematical problems, hindering scalability and generalizability. In this work, we propose Caco (Code-Assisted Chain-of-ThOught), a novel framework that automates the synthesis of high-quality, verifiable, and diverse instruction-CoT reasoning data through code-driven augmentation. Unlike prior work, Caco first fine-tunes a code-based CoT generator on existing math and programming solutions in a unified code format, then scales the data generation to a large amount of diverse reasoning traces. Crucially, we introduce automated validation via code execution and rule-based filtering to ensure logical correctness and structural diversity, followed by reverse-engineering filtered outputs into natural language instructions and language CoTs to enrich task adaptability. This closed-loop process enables fully automated, scalable synthesis of reasoning data with guaranteed executability. Experiments on our created Caco-1.3M dataset demonstrate that Caco-trained models achieve strong competitive performance on mathematical reasoning benchmarks, outperforming existing strong baselines. Further analysis reveals that Caco’s code-anchored verification and instruction diversity contribute to superior generalization across unseen tasks. Our work establishes a paradigm for building self-sustaining, trustworthy reasoning systems without human intervention.

[49] Unveiling LLMs’ Metaphorical Understanding: Exploring Conceptual Irrelevance, Context Leveraging and Syntactic Influence

Fengying Ye, Shanshan Wang, Lidia S. Chao, Derek F. Wong

Main category: cs.CL

TL;DR: LLMs struggle with metaphor comprehension, generating 15-25% irrelevant interpretations, relying on training data patterns rather than context, and showing sensitivity to syntax over meaning.

DetailsMotivation: To explore LLMs' metaphor-processing abilities since their mechanisms for metaphor comprehension remain insufficiently explored despite advanced capabilities in other areas.

Method: Three-pronged approach: (1) Concept Mapping using embedding space projections, (2) Metaphor-Literal Repository analysis of metaphorical words and literal counterparts, (3) Syntactic Sensitivity assessment of metaphorical structures.

Result: LLMs generate 15-25% conceptually irrelevant interpretations, depend on training data indicators rather than contextual cues, and are more sensitive to syntactic irregularities than structural comprehension.

Conclusion: LLMs have significant limitations in metaphor analysis, highlighting the need for more robust computational approaches to metaphor processing.

Abstract: Metaphor analysis is a complex linguistic phenomenon shaped by context and external factors. While Large Language Models (LLMs) demonstrate advanced capabilities in knowledge integration, contextual reasoning, and creative generation, their mechanisms for metaphor comprehension remain insufficiently explored. This study examines LLMs’ metaphor-processing abilities from three perspectives: (1) Concept Mapping: using embedding space projections to evaluate how LLMs map concepts in target domains (e.g., misinterpreting “fall in love” as “drop down from love”); (2) Metaphor-Literal Repository: analyzing metaphorical words and their literal counterparts to identify inherent metaphorical knowledge; and (3) Syntactic Sensitivity: assessing how metaphorical syntactic structures influence LLMs’ performance. Our findings reveal that LLMs generate 15%-25% conceptually irrelevant interpretations, depend on metaphorical indicators in training data rather than contextual cues, and are more sensitive to syntactic irregularities than to structural comprehension. These insights underline the limitations of LLMs in metaphor analysis and call for more robust computational approaches.

[50] Sri Lanka Document Datasets: A Large-Scale, Multilingual Resource for Law, News, and Policy (v20251005)

Nuwan I. Senaratna

Main category: cs.CL

TL;DR: A collection of 215,670 multilingual documents (60.3 GB) from Sri Lankan parliamentary, legal, government, news, and tourism sources, updated daily and available on GitHub/Hugging Face.

DetailsMotivation: To provide open, machine-readable datasets to support research in computational linguistics, legal analytics, socio-political studies, and multilingual NLP for Sri Lankan languages.

Method: Created a data collection pipeline from various Sri Lankan sources including parliamentary proceedings, legal judgments, government publications, news, and tourism statistics in Sinhala, Tamil, and English.

Result: Successfully compiled 13 datasets totaling 215,670 documents (60.3 GB) as of v20251005, with daily updates and availability on multiple platforms.

Conclusion: These resources fill a gap in multilingual datasets for Sri Lankan languages and enable diverse research applications while addressing licensing and ethical considerations.

Abstract: We present a collection of open, machine-readable document datasets covering parliamentary proceedings, legal judgments, government publications, news, and tourism statistics from Sri Lanka. As of v20251005, the collection currently comprises 215,670 documents (60.3 GB) across 13 datasets in Sinhala, Tamil, and English. The datasets are updated daily and mirrored on GitHub and Hugging Face. These resources aim to support research in computational linguistics, legal analytics, socio-political studies, and multilingual natural language processing. We describe the data sources, collection pipeline, formats, and potential use cases, while discussing licensing and ethical considerations.

[51] Fine Tuning Methods for Low-resource Languages

Tim Bakkenes, Daniel Wang, Anton Johansson

Main category: cs.CL

TL;DR: This paper addresses the cultural bias in Large Language Models by developing a method to improve performance for underrepresented languages through culturally relevant dataset preparation and post-training of the Gemma 2 model.

DetailsMotivation: Large Language Models are predominantly trained on English texts and culture, causing them to underperform in other languages and cultural contexts, which excludes many cultures from benefiting from AI advancements.

Method: Developed a generalizable method for preparing culturally relevant datasets and applied post-training techniques to the Gemma 2 model to enhance its performance for underrepresented languages.

Result: The approach successfully increased Gemma 2’s performance for an underrepresented language, demonstrating the feasibility of adapting LLMs to diverse cultural contexts.

Conclusion: This method provides a replicable framework for others to unlock Generative AI’s potential in their countries while preserving cultural heritage, promoting more inclusive AI development.

Abstract: The rise of Large Language Models has not been inclusive of all cultures. The models are mostly trained on English texts and culture which makes them underperform in other languages and cultural contexts. By developing a generalizable method for preparing culturally relevant datasets and post-training the Gemma 2 model, this project aimed to increase the performance of Gemma 2 for an underrepresented language and showcase how others can do the same to unlock the power of Generative AI in their country and preserve their cultural heritage.

[52] Self Speculative Decoding for Diffusion Large Language Models

Yifeng Gao, Ziang Ji, Yuxuan Wang, Biqing Qi, Hanlin Xu, Linfeng Zhang

Main category: cs.CL

TL;DR: SSD is a lossless inference acceleration method for diffusion-based LLMs that uses self-speculative decoding to achieve up to 3.46× speedup while maintaining identical output to stepwise decoding.

DetailsMotivation: Current parallel decoding methods in diffusion-based LLMs deviate from stepwise decoding, causing performance degradation and limiting practical deployment.

Method: Proposes Self Speculative Decoding (SSD) that uses the dLLM itself as both drafter and verifier, generating predictions for multiple positions and verifying them through hierarchical verification trees in a single forward pass.

Result: SSD achieves up to 3.46× speedup on models like LLaDA and Dream while keeping output identical to stepwise decoding.

Conclusion: SSD provides an efficient lossless acceleration method for dLLMs by eliminating model redundancy and memory overhead through self-speculative decoding.

Abstract: Diffusion-based Large Language Models (dLLMs) have emerged as a competitive alternative to autoregressive models, offering unique advantages through bidirectional attention and parallel generation paradigms. However, the generation results of current parallel decoding methods deviate from stepwise decoding, introducing potential performance degradation, which limits their practical deployment. To address this problem, we propose \textbf{S}elf \textbf{S}peculative \textbf{D}ecoding (SSD), a lossless inference acceleration method that leverages the dLLM itself as both speculative decoding drafter and verifier without auxiliary modules. SSD introduces a self-drafting mechanism where the model generates predictions for multiple positions, then verifies them through hierarchical verification trees in a single forward pass. Unlike traditional speculative decoding that requires separate draft models, SSD eliminates model redundancy and memory overhead by exploiting the dLLM’s inherent parallel prediction capability for multiple positions. This self-speculative approach allows the model to progressively verify and accept multiple tokens in a single forward pass. Our experiments demonstrate that SSD achieves up to 3.46$\times$ speedup while keeping the output identical to stepwise decoding on open source models such as LLaDA and Dream. Code will be made publicly available on GitHub.

[53] Thinking on the Fly: Test-Time Reasoning Enhancement via Latent Thought Policy Optimization

Wengao Ye, Yan Liang, Lianlei Shan

Main category: cs.CL

TL;DR: LTPO is a parameter-free framework that optimizes latent thought vectors at test time using policy gradient methods with intrinsic confidence-based rewards, significantly improving reasoning robustness on challenging tasks.

DetailsMotivation: Current latent reasoning approaches in LLMs are brittle on challenging out-of-distribution tasks where robust reasoning is critical, despite being more efficient than explicit CoT reasoning.

Method: LTPO treats intermediate latent thought vectors as dynamic parameters optimized per problem instance using online policy gradient methods guided by intrinsic confidence-based reward signals from the frozen LLM’s output distributions.

Result: LTPO matches or surpasses strong baselines on standard tasks and shows remarkable robustness where others fail, achieving substantial improvements on highly challenging AIME benchmarks where existing latent reasoning baselines collapse to near-zero accuracy.

Conclusion: LTPO demonstrates unique capability for complex reasoning by optimizing latent thoughts at test time without parameter updates, providing a robust solution for challenging reasoning tasks.

Abstract: Recent advancements in Large Language Models (LLMs) have shifted from explicit Chain-of-Thought (CoT) reasoning to more efficient latent reasoning, where intermediate thoughts are represented as vectors rather than text. However, latent reasoning can be brittle on challenging, out-of-distribution tasks where robust reasoning is most critical. To overcome these limitations, we introduce Latent Thought Policy Optimization (LTPO), a parameter-free framework that enhances LLM reasoning entirely at test time, without requiring model parameter updates. LTPO treats intermediate latent “thought” vectors as dynamic parameters that are actively optimized for each problem instance. It employs an online policy gradient method guided by an intrinsic, confidence-based reward signal computed directly from the frozen LLM’s own output distributions, eliminating the need for external supervision or expensive text generation during optimization. Extensive experiments on five reasoning benchmarks show that LTPO not only matches or surpasses strong baselines on standard tasks but also demonstrates remarkable robustness where others fail. Most notably, on highly challenging AIME benchmarks where existing latent reasoning baselines collapse to near-zero accuracy, LTPO delivers substantial improvements, showcasing a unique capability for complex reasoning.

[54] CALM Before the STORM: Unlocking Native Reasoning for Optimization Modeling

Zhengyang Tang, Zihan Ye, Chenyu Huang, Xuhan Huang, Chengpeng Li, Sihang Li, Guanhua Chen, Ming Yan, Zizhuo Wang, Hongyuan Zha, Dayiheng Liu, Benyou Wang

Main category: cs.CL

TL;DR: CALM framework uses corrective hints to refine LRMs’ reasoning for optimization modeling, achieving state-of-the-art performance with minimal token modification.

DetailsMotivation: Existing domain adaptation methods fail to exploit modern LRMs' advanced reasoning patterns, leading to limited gains in optimization modeling tasks.

Method: CALM framework with expert interventions identifying reasoning flaws and providing corrective hints, followed by supervised fine-tuning and reinforcement learning.

Result: STORM model achieves 68.9% average accuracy across five optimization benchmarks, matching 671B LRM performance with only 4B parameters.

Conclusion: Dynamic hint-based data synthesis preserves and amplifies LRMs’ native reasoning patterns, offering effective path to expert-level optimization modeling.

Abstract: Large Reasoning Models (LRMs) have demonstrated strong capabilities in complex multi-step reasoning, opening new opportunities for automating optimization modeling. However, existing domain adaptation methods, originally designed for earlier instruction-tuned models, often fail to exploit the advanced reasoning patterns of modern LRMs – In particular, we show that direct fine-tuning on traditional \textit{non-reflective} datasets leads to limited gains. To fully leverage LRMs’ inherent reasoning abilities, we propose \textbf{CALM} (\textit{Corrective Adaptation with Lightweight Modification}), a framework that progressively refines LRMs within their native reasoning modes for optimization modeling tasks. In CALM, an expert intervener identifies reasoning flaws and provides concise corrective hints, which the LRM incorporates to produce improved reasoning trajectories. These interventions modify fewer than 2.6% of generated tokens, but generate high-quality data for soft adaptation through supervised fine-tuning. The adapted model is then further improved through reinforcement learning. Building on CALM, we develop \textbf{STORM} (\textit{Smart Thinking Optimization Reasoning Model}), a 4B-parameter LRM that achieves a new state-of-the-art average accuracy of 68.9% across five popular optimization modeling benchmarks, matching the performance of a 671B LRM. These results demonstrate that dynamic, hint-based data synthesis both preserves and amplifies the native reasoning patterns of modern LRMs, offering a more effective and scalable path towards expert-level performance on challenging optimization modeling tasks.

[55] Teaching LLM to be Persuasive: Reward-Enhanced Policy Optimization for Alignment frm Heterogeneous Rewards

Zhuoran Zhuang, Ye Chen, Xia Zeng, Chao Luo, Luhui Liu, Yihan Chen

Main category: cs.CL

TL;DR: REPO is a reinforcement learning framework that combines multiple reward signals to align LLMs for persuasive price negotiation in OTAs, outperforming existing methods in dialogue quality and constraint compliance.

DetailsMotivation: Existing post-training methods (SFT, single-source reward optimization) overfit scripts, miss nuanced persuasive style, and fail to enforce verifiable business constraints needed for effective price negotiation in online travel agencies.

Method: Proposed Reward-Enhanced Policy Optimization (REPO) that combines heterogeneous rewards: preference-trained reward model for human alignment, reward judge for persuasive behavior and SOP compliance, and programmatic reward functions for deterministic checks on numerics, formatting, and guardrails.

Result: REPO achieved average dialogue rating of 4.63 (+1.20 over base, +0.83 over DPO), increased excellent response conversations to 66.67% (+23.34 pp over GRPO), and achieved 93.33% bad-case fix rate with 75.56% clean fixes, outperforming SFT, DPO, PPO, and GRPO.

Conclusion: REPO effectively aligns LLMs for persuasive negotiation while enforcing business constraints, demonstrating emergent capabilities like proactive empathy and calibrated tactics that surpass gold annotations.

Abstract: We study deploying large language models (LLMs) as business development (BD) agents for persuasive price negotiation in online travel agencies (OTAs), where aligning traveler affordability and hotel profitability directly affects bookings, partner relationships, and access to travel. The agent must follow a Standard Operating Procedure (SOP) while conducting multi-turn persuasion, interpreting colloquial inputs, and adhering to guardrails (no over-promising, no hallucinations). Conventional post-training – supervised fine-tuning (SFT) or single-source reward optimization – overfits scripts, misses nuanced persuasive style, and fails to enforce verifiable business constraints. We propose Reward-Enhanced Policy Optimization (REPO), a reinforcement learning post-training framework that aligns an LLM with heterogeneous rewards: a preference-trained reward model (RM) for dense human alignment, a reward judge (RJ) for high-level persuasive behavior and SOP compliance, and programmatic reward functions (RF) for deterministic checks on numerics, formatting, and guardrails. A straightforward enhancement mechanism is proposed to combine the RM with RJ and RF signals to curb reward hacking and improve negotiation quality. In production-style evaluations – approximately 150 turns from real dialogues and 225 turns from curated bad-case dialogues – REPO lifts average dialogue rating to 4.63: +1.20 over base, +0.83 over Direct Preference Optimization (DPO); +0.33 over Group Relative Policy Optimization (GRPO), increases the share of conversations with at least one excellent response to 66.67% (+23.34 percentage points over GRPO), and achieves a 93.33% bad-case fix rate with 75.56% clean fixes, outperforming SFT, DPO, PPO, and GRPO. We also observe emergent capabilities – proactive empathy, localized reasoning, calibrated tactics – that surpass gold annotations.

[56] Epistemic Diversity and Knowledge Collapse in Large Language Models

Dustin Wright, Sarah Masud, Jared Moore, Srishti Yadav, Maria Antoniak, Chan Young Park, Isabelle Augenstein

Main category: cs.CL

TL;DR: LLMs generate homogenous texts risking knowledge collapse. This study measures epistemic diversity across 27 LLMs, 155 topics, and 12 countries, finding newer models are more diverse but still less than web searches. Model size reduces diversity while RAG improves it, with cultural context affecting RAG’s effectiveness.

DetailsMotivation: LLMs generate homogenous texts that risk knowledge collapse - shrinking accessible information over time. Existing work is limited to closed-ended setups or fuzzy semantics without examining temporal and cultural trends.

Method: Developed methodology to measure epistemic diversity (variation in real-world claims). Tested 27 LLMs on 155 topics covering 12 countries using 200 prompt variations from real user chats. Compared against web searches and Wikipedia.

Result: Newer models generate more diverse claims but nearly all are less epistemically diverse than basic web search. Model size negatively impacts diversity. RAG positively impacts diversity, but improvement varies by cultural context. Country-specific claims reflect English language more than local languages.

Conclusion: LLMs show concerning epistemic homogenization compared to traditional knowledge sources. Cultural context significantly affects knowledge representation, with English dominance in country-specific claims highlighting gaps in epistemic representation.

Abstract: Large language models (LLMs) tend to generate lexically, semantically, and stylistically homogenous texts. This poses a risk of knowledge collapse, where homogenous LLMs mediate a shrinking in the range of accessible information over time. Existing works on homogenization are limited by a focus on closed-ended multiple-choice setups or fuzzy semantic features, and do not look at trends across time and cultural contexts. To overcome this, we present a new methodology to measure epistemic diversity, i.e., variation in real-world claims in LLM outputs, which we use to perform a broad empirical study of LLM knowledge collapse. We test 27 LLMs, 155 topics covering 12 countries, and 200 prompt variations sourced from real user chats. For the topics in our study, we show that while newer models tend to generate more diverse claims, nearly all models are less epistemically diverse than a basic web search. We find that model size has a negative impact on epistemic diversity, while retrieval-augmented generation (RAG) has a positive impact, though the improvement from RAG varies by the cultural context. Finally, compared to a traditional knowledge source (Wikipedia), we find that country-specific claims reflect the English language more than the local one, highlighting a gap in epistemic representation

[57] Pushing on Multilingual Reasoning Models with Language-Mixed Chain-of-Thought

Guijin Son, Donghun Yang, Hitesh Laxmichand Patel, Amit Agarwal, Hyunwoo Ko, Chanuk Lim, Srikant Panda, Minhyuk Kim, Nikunj Drolia, Dasol Choi, Kyong-Ha Lee, Youngjae Yu

Main category: cs.CL

TL;DR: The paper introduces Language-Mixed CoT, a reasoning approach that switches between English and target languages to improve reasoning while minimizing translation artifacts. Using Korean as a case study, they create Yi-Sang dataset and train models that achieve state-of-the-art performance.

DetailsMotivation: To bridge the gap in understanding language-specific reasoning, as most distillation works focus on English and little is known about reasoning in other languages.

Method: Develop Language-Mixed CoT reasoning schema, curate Yi-Sang dataset (5.79M Korean prompts with 3.7M reasoning traces), train models across multiple families (4B-35B parameters).

Result: Best model KO-REAson-35B achieves state-of-the-art with highest overall average score (64.0 ± 25), ranking first on 5/9 benchmarks. Smaller models show average improvement of +18.6 points across nine benchmarks.

Conclusion: Language-Mixed CoT is more effective than monolingual CoT and provides cross-lingual and multimodal performance gains. The approach advances research on language-specific reasoning.

Abstract: Recent frontier models employ long chain-of-thought reasoning to explore solution spaces in context and achieve stonger performance. While many works study distillation to build smaller yet capable models, most focus on English and little is known about language-specific reasoning. To bridge this gap, we first introduct Language-Mixed CoT, a reasoning schema that switches between English and a target language, using English as an anchor to excel in reasoning while minimizing translation artificats. As a Korean case study, we curate Yi-Sang: 5.79M native-Korean prompts from web Q&A, exams, STEM, and code; 3.7M long reasoning traces generated from Qwen3-32B; and a targeted 260k high-yield subset. We train ninve models (4B-35B) across six families (Qwen2.5, Llama-3.1, Gemma-3, etc). Our best model, KO-REAson-35B, achieves state-of-the-art performance, with the highest overall average score (64.0 \pm 25), ranking first on 5/9 benchmarks and second on the remainder. Samller and mid-sized models also benefit substantially, with an average improvement of +18.6 points across teh evaluated nine benchmarks. Ablations show Language-Mixed CoT is more effective than monolingual CoT, also resulting in cross-lingual and mult-modal performance gains. We release our data-curation pipeline, evaluation system, datasets, and models to advance research on language-specific reasoning. Data and model collection: https://huggingface.co/KOREAson.

[58] LongTail-Swap: benchmarking language models’ abilities on rare words

Robin Algayres, Charles-Éric Saint-James, Mahi Luthra, Jiayi Shen, Dongyan Lin, Youssef Benchekroun, Rashel Moritz, Juan Pino, Emmanuel Dupoux

Main category: cs.CL

TL;DR: The paper introduces LongTail-Swap (LT-Swap), a benchmark for evaluating language models’ ability to learn rare words with minimal exposure, similar to infants. It tests models on semantic and syntactic usage of rare words through acceptable/unacceptable sentence pairs.

DetailsMotivation: Current language model evaluation focuses on common words (head of distribution), but children learn efficiently with little data, especially for rare words. The BabyLM challenge uses metrics that don't adequately measure rare word learning capabilities.

Method: Created LT-Swap benchmark with pretraining corpus-specific test sets containing acceptable vs unacceptable sentence pairs that isolate rare word usage. Evaluated 16 BabyLM models zero-shot by computing average log probabilities for sentence pairs.

Result: Language models perform poorly on rare words. Performance differences across architectures are much more pronounced in the long tail than in common words, revealing which architectures handle rare word generalization better.

Conclusion: LT-Swap provides new insights into language model capabilities for rare word learning, showing that architectural differences matter significantly more for tail distribution performance than for head distribution.

Abstract: Children learn to speak with a low amount of data and can be taught new words on a few-shot basis, making them particularly data-efficient learners. The BabyLM challenge aims at exploring language model (LM) training in the low-data regime but uses metrics that concentrate on the head of the word distribution. Here, we introduce LongTail-Swap (LT-Swap), a benchmark that focuses on the tail of the distribution, i.e., measures the ability of LMs to learn new words with very little exposure, like infants do. LT-Swap is a pretraining corpus-specific test set of acceptable versus unacceptable sentence pairs that isolate semantic and syntactic usage of rare words. Models are evaluated in a zero-shot fashion by computing the average log probabilities over the two members of each pair. We built two such test sets associated with the 10M words and 100M words BabyLM training sets, respectively, and evaluated 16 models from the BabyLM leaderboard. Our results not only highlight the poor performance of language models on rare words but also reveal that performance differences across LM architectures are much more pronounced in the long tail than in the head. This offers new insights into which architectures are better at handling rare word generalization. We’ve also made the code publicly avail

[59] Probing Geometry of Next Token Prediction Using Cumulant Expansion of the Softmax Entropy

Karthik Viswanathan, Sang Eon Park

Main category: cs.CL

TL;DR: A cumulant-expansion framework for analyzing how LLMs learn higher-order statistical structure during next-token prediction, revealing layer-wise learning dynamics and domain-specific processing mechanisms.

DetailsMotivation: To quantify how large language models internalize higher-order statistical structure during training and inference, providing mathematical insights into feature-learning dynamics.

Method: Treating softmax entropy as perturbation around center distribution to derive closed-form cumulant observables that isolate higher-order correlations, applied to GPT-2 and Pythia models on Pile-10K prompts.

Result: Structured prompts show rise-and-plateau cumulant profiles, training reveals monotonic increase of cumulants from variance to higher-order structures, and mathematical prompts exhibit distinct signatures from general text.

Conclusion: Cumulant analysis provides a lightweight, mathematically grounded method for probing feature-learning dynamics in high-dimensional neural networks.

Abstract: We introduce a cumulant-expansion framework for quantifying how large language models (LLMs) internalize higher-order statistical structure during next-token prediction. By treating the softmax entropy of each layer’s logit distribution as a perturbation around its “center” distribution, we derive closed-form cumulant observables that isolate successively higher-order correlations. Empirically, we track these cumulants in GPT-2 and Pythia models on Pile-10K prompts. (i) Structured prompts exhibit a characteristic rise-and-plateau profile across layers, whereas token-shuffled prompts remain flat, revealing the dependence of the cumulant profile on meaningful context. (ii) During training, all cumulants increase monotonically before saturating, directly visualizing the model’s progression from capturing variance to learning skew, kurtosis, and higher-order statistical structures. (iii) Mathematical prompts show distinct cumulant signatures compared to general text, quantifying how models employ fundamentally different processing mechanisms for mathematical versus linguistic content. Together, these results establish cumulant analysis as a lightweight, mathematically grounded probe of feature-learning dynamics in high-dimensional neural networks.

[60] SliceMoE: Routing Embedding Slices Instead of Tokens for Fine-Grained and Balanced Transformer Scaling

Harshil Vejendla

Main category: cs.CL

TL;DR: SliceMoE improves Mixture-of-Experts by routing token slices instead of entire tokens, achieving better load balancing, specialization, and faster inference while maintaining efficiency.

DetailsMotivation: Token-level routing in MoE assigns entire semantic spectrums to experts, causing capacity bottlenecks, load-balancing issues, and limited specialization.

Method: Partition d-dimensional embeddings into S slices, use lightweight shared router for top-k expert selection per slice, operate experts independently on assigned slices, and reassemble outputs with slice-level capacity loss and cross-slice dropout.

Result: 1.7x faster inference than dense baselines, 12-18% lower perplexity than token-MoE, improved expert balance, and interpretable expertise over syntactic vs semantic subspaces.

Conclusion: SliceMoE effectively addresses MoE limitations through slice-level routing, achieving superior performance, efficiency, and interpretable expert specialization.

Abstract: Mixture-of-Experts (MoE) layers scale transformers by routing tokens to a sparse subset of feed-forward experts. Token-level routing, however, assigns an entire semantic spectrum to each expert, creating capacity bottlenecks, load-balancing pathologies, and limited specialization. We introduce SliceMoE, an architecture that routes contiguous slices of a token’s hidden vector. A d-dimensional embedding is partitioned into S slices, and for each slice, a lightweight shared router predicts the top-k experts. Experts operate on their assigned slices independently, and outputs are reassembled, maintaining per-token FLOP efficiency. Because slices from different tokens interleave within an expert, utilization is naturally smoother. We propose a slice-level capacity loss, cross-slice dropout, and efficient fused batched GEMM kernels. Experiments on WikiText-103 language modeling, WMT En-De translation, and three text-classification datasets show SliceMoE attains up to 1.7x faster inference than dense baselines, 12 to 18 percent lower perplexity than parameter-matched token-MoE, and improved expert balance, with interpretable expertise over syntactic versus semantic subspaces.

[61] PABSA: Hybrid Framework for Persian Aspect-Based Sentiment Analysis

Mehrzad Tareh, Aydin Mohandesi, Ebrahim Ansari

Main category: cs.CL

TL;DR: A hybrid ML-DL approach for Persian aspect-based sentiment analysis using multilingual BERT features and decision trees, achieving 93.34% accuracy and introducing a Persian synonym/entity dictionary for text augmentation.

DetailsMotivation: Persian sentiment analysis faces challenges due to scarce labeled datasets, limited preprocessing tools, and lack of quality embeddings/features for this low-resource language.

Method: Hybrid approach combining machine learning and deep learning, using multilingual BERT polarity scores as features in decision tree classifier, plus Persian synonym/entity dictionary for text augmentation.

Result: Achieved 93.34% accuracy on Pars-ABSA dataset, surpassing existing benchmarks for Persian ABSA.

Conclusion: Hybrid modeling and feature augmentation effectively advance sentiment analysis for low-resource languages like Persian.

Abstract: Sentiment analysis is a key task in Natural Language Processing (NLP), enabling the extraction of meaningful insights from user opinions across various domains. However, performing sentiment analysis in Persian remains challenging due to the scarcity of labeled datasets, limited preprocessing tools, and the lack of high-quality embeddings and feature extraction methods. To address these limitations, we propose a hybrid approach that integrates machine learning (ML) and deep learning (DL) techniques for Persian aspect-based sentiment analysis (ABSA). In particular, we utilize polarity scores from multilingual BERT as additional features and incorporate them into a decision tree classifier, achieving an accuracy of 93.34%-surpassing existing benchmarks on the Pars-ABSA dataset. Additionally, we introduce a Persian synonym and entity dictionary, a novel linguistic resource that supports text augmentation through synonym and named entity replacement. Our results demonstrate the effectiveness of hybrid modeling and feature augmentation in advancing sentiment analysis for low-resource languages such as Persian.

[62] Equipping Retrieval-Augmented Large Language Models with Document Structure Awareness

Lingnan Xu, Chong Feng, Kaiyuan Zhang, Liu Zhengyong, Wenqiang Xu, Fanqing Meng

Main category: cs.CL

TL;DR: RDR2 is a novel RAG framework that incorporates document structure information through trainable document routing, achieving state-of-the-art performance by better handling complex multi-document scenarios.

DetailsMotivation: Existing RAG approaches treat retrieved passages as isolated chunks, ignoring valuable document structure that is crucial for organization and knowledge synthesis.

Method: Proposes Retrieve-DocumentRoute-Read (RDR2) framework with LLM-based router that dynamically navigates document structure trees, jointly evaluating content relevance and hierarchical relationships to assemble optimal evidence.

Result: Achieves state-of-the-art performance on five challenging datasets, demonstrating significant enhancement in RAG systems’ ability to acquire and utilize knowledge.

Conclusion: Explicit structural awareness significantly enhances RAG systems, particularly in complex scenarios requiring multi-document synthesis.

Abstract: While large language models (LLMs) demonstrate impressive capabilities, their reliance on parametric knowledge often leads to factual inaccuracies. Retrieval-Augmented Generation (RAG) mitigates this by leveraging external documents, yet existing approaches treat retrieved passages as isolated chunks, ignoring valuable structure that is crucial for document organization. Motivated by this gap, we propose Retrieve-DocumentRoute-Read (RDR2), a novel framework that explicitly incorporates structural information throughout the RAG process. RDR2 employs an LLM-based router to dynamically navigate document structure trees, jointly evaluating content relevance and hierarchical relationships to assemble optimal evidence. Our key innovation lies in formulating document routing as a trainable task, with automatic action curation and structure-aware passage selection inspired by human reading strategies. Through comprehensive evaluation on five challenging datasets, RDR2 achieves state-of-the-art performance, demonstrating that explicit structural awareness significantly enhances RAG systems’ ability to acquire and utilize knowledge, particularly in complex scenarios requiring multi-document synthesis.

[63] Measuring Language Model Hallucinations Through Distributional Correctness

Thomas F Burns

Main category: cs.CL

TL;DR: The paper introduces Distributional Correctness Score (DCS), a novel evaluation metric that considers a model’s entire probability distribution over answer choices, distinguishing between harmful overconfidence in wrong answers and uncertainty expressed through abstention.

DetailsMotivation: Current evaluation paradigms for language models focus on scoring single responses through accuracy metrics, failing to capture the full richness of a model's belief state and ignoring crucial distinctions in how models distribute uncertainty.

Method: Introduces DCS metric that analyzes a model’s probability distribution over answer choices, distinguishing between hedging toward incorrect answers versus hedging toward “I don’t know” responses. Adapts 12 existing evaluation benchmarks to DCS variants and tests on six language models.

Result: For half of the tested benchmarks, scores are negative across all tested models, indicating significant tendencies towards hallucination. DCS provides scores in an interpretable default range and offers more nuanced evaluation.

Conclusion: DCS offers a more aligned evaluation paradigm that incentivizes models to express genuine uncertainty rather than guessing, addressing the problem of models being optimized for binary scoring schemes that reward any answer over abstention.

Abstract: Common evaluation paradigms for language models focus on scoring single responses through accuracy metrics or proper scoring rules, failing to capture the full richness of a model’s belief state. Recent work illustrates that language models hallucinate in-part because they are optimised to be good test-takers under binary scoring schemes that reward any answer over abstention. While this insight naturally leads to penalty-based approaches, they ignore crucial distinctions in how models distribute uncertainty, for example between hedging toward incorrect answers versus hedging toward “I don’t know” responses. A novel evaluation metric, the Distributional Correctness Score (DCS), is introduced to solve this problem, i.e., of not considering a model’s entire probability distribution over answer choices. DCS naturally distinguishes between harmful overconfidence in wrong answers and uncertainty expressed through abstention, providing scores in an interpretable default range. Through theoretical analysis and illustrative examples, DCS is demonstrated to offer a more nuanced and aligned evaluation paradigm that incentivises models to express genuine uncertainty rather than guessing. Adapting 12 existing evaluation benchmarks to DCS’s variants and measuring performance on six language models reveals that for half of the tested benchmarks scores are negative across all tested models, indicating significant tendencies towards hallucination.

[64] Read the Scene, Not the Script: Outcome-Aware Safety for LLMs

Rui Wu, Yihao Quan, Zeru Shi, Zhenting Wang, Yanshu Li, Ruixiang Tang

Main category: cs.CL

TL;DR: Current safety-aligned LLMs suffer from consequence-blindness - they either get easily jailbroken or over-refuse harmless inputs due to weak reasoning about action-outcome links and over-reliance on surface signals.

DetailsMotivation: To address the dual failure modes of jailbreaking and over-refusal in safety-aligned LLMs, which stem from consequence-blindness - the inability to properly reason about links between actions and outcomes.

Method: Built CB-Bench benchmark covering four risk scenarios with matched/mismatched semantic and outcome risks, and created CS-Chain-4k dataset for consequence-reasoning fine-tuning.

Result: Models fine-tuned on CS-Chain-4k show improved resistance to semantic-camouflage jailbreaks, reduced over-refusal on harmless inputs, while maintaining utility and generalization on other benchmarks.

Conclusion: Consequence-blindness is widespread in current LLMs, consequence-aware reasoning should be a core alignment goal, and the proposed methods provide a more practical evaluation path.

Abstract: Safety-aligned Large Language Models (LLMs) still show two dominant failure modes: they are easily jailbroken, or they over-refuse harmless inputs that contain sensitive surface signals. We trace both to a common cause: current models reason weakly about links between actions and outcomes and over-rely on surface-form signals, lexical or stylistic cues that do not encode consequences. We define this failure mode as Consequence-blindness. To study consequence-blindness, we build a benchmark named CB-Bench covering four risk scenarios that vary whether semantic risk aligns with outcome risk, enabling evaluation under both matched and mismatched conditions which are often ignored by existing safety benchmarks. Mainstream models consistently fail to separate these risks and exhibit consequence-blindness, indicating that consequence-blindness is widespread and systematic. To mitigate consequence-blindness, we introduce CS-Chain-4k, a consequence-reasoning dataset for safety alignment. Models fine-tuned on CS-Chain-4k show clear gains against semantic-camouflage jailbreaks and reduce over-refusal on harmless inputs, while maintaining utility and generalization on other benchmarks. These results clarify the limits of current alignment, establish consequence-aware reasoning as a core alignment goal and provide a more practical and reproducible evaluation path.

[65] Evaluation of Clinical Trials Reporting Quality using Large Language Models

Mathieu Laï-king, Patrick Paroubek

Main category: cs.CL

TL;DR: Large language models can assess clinical trial reporting quality using CONSORT standards with 85% accuracy, with Chain-of-thought prompting providing valuable reasoning insights.

DetailsMotivation: Reporting quality in clinical trial research articles impacts clinical decisions, and there's a need to test if large language models can effectively assess this quality using CONSORT standards.

Method: Created CONSORT-QA evaluation corpus from two studies on abstract reporting quality with CONSORT-abstract standards, then evaluated various large language models (general and biomedical domain) using different prompting methods including Chain-of-thought.

Result: The best combination of model and prompting method achieved 85% accuracy in assessing CONSORT criteria. Chain-of-thought provided valuable information about the model’s reasoning process.

Conclusion: Large language models show promise for assessing clinical trial reporting quality, with Chain-of-thought prompting enhancing both performance and interpretability of the assessment process.

Abstract: Reporting quality is an important topic in clinical trial research articles, as it can impact clinical decisions. In this article, we test the ability of large language models to assess the reporting quality of this type of article using the Consolidated Standards of Reporting Trials (CONSORT). We create CONSORT-QA, an evaluation corpus from two studies on abstract reporting quality with CONSORT-abstract standards. We then evaluate the ability of different large generative language models (from the general domain or adapted to the biomedical domain) to correctly assess CONSORT criteria with different known prompting methods, including Chain-of-thought. Our best combination of model and prompting method achieves 85% accuracy. Using Chain-of-thought adds valuable information on the model’s reasoning for completing the task.

[66] Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time

Daniel Tan, Anders Woodruff, Niels Warncke, Arun Jose, Maxime Riché, David Demitri Africa, Mia Taylor

Main category: cs.CL

TL;DR: Inoculation prompting modifies finetuning data by prepending instructions that deliberately elicit undesirable traits, reducing their expression at test time without the instruction.

DetailsMotivation: Language model finetuning often results in learning undesirable traits alongside desired ones, creating a need for selective learning methods.

Method: Proposed inoculation prompting: prepend short system-prompt instructions to finetuning data that deliberately elicit undesirable traits, then evaluate without the instruction.

Result: Inoculated models show much lower expression of undesirable traits across multiple settings: reducing emergent misalignment, defending against backdoor injections, and mitigating trait transmission via subliminal learning.

Conclusion: Inoculation is an effective technique for selective learning that reduces optimization pressure to globally update models, contributing to better understanding of how language models generalize.

Abstract: Language model finetuning often results in learning undesirable traits in combination with desired ones. To address this, we propose inoculation prompting: modifying finetuning data by prepending a short system-prompt instruction that deliberately elicits the undesirable trait. At test time, we evaluate without the instruction; inoculated models have much lower expression of the trait than models trained with unmodified training data. Inoculation is selective: in a toy setting where assistant responses are always in Spanish and ALL-CAPS, an appropriate inoculation (e.g., ``You always speak in Spanish.’’) teaches the model to capitalize responses while still responding in English. We find that inoculation is also effective across several additional settings: reducing emergent misalignment (EM) from task-specific finetuning, defending against backdoor injections, and mitigating the transmission of traits via subliminal learning. Follow-up analysis suggests a mechanism: making a trait less surprising via inoculation reduces optimization pressure to globally update the model, thereby reducing the degree of generalization. Our analysis relates to prior work on EM: inoculation explains prior findings that educational contexts mitigate EM from insecure code. Beyond demonstrating a simple and effective technique for selective learning, our results contribute to a better conceptual understanding of how and why language models generalize.

[67] Unmasking Backdoors: An Explainable Defense via Gradient-Attention Anomaly Scoring for Pre-trained Language Models

Anindya Sundar Das, Kangjie Chen, Monowar Bhuyan

Main category: cs.CL

TL;DR: The paper investigates backdoor attacks in pre-trained language models and proposes an inference-time defense using attention and gradient anomaly scores to detect and mitigate trigger-based attacks.

DetailsMotivation: Pre-trained language models are vulnerable to backdoor attacks where triggers embedded in training data can cause targeted misclassifications when activated, posing security risks.

Method: Proposes an inference-time defense that constructs anomaly scores by combining token-level attention and gradient information to detect poisoned inputs and trigger patterns.

Result: Extensive experiments show the method significantly reduces attack success rates across diverse backdoor attack scenarios in text classification tasks compared to existing baselines.

Conclusion: The proposed attention and gradient-based defense effectively mitigates backdoor attacks while providing interpretable trigger localization, enhancing model robustness against security threats.

Abstract: Pre-trained language models have achieved remarkable success across a wide range of natural language processing (NLP) tasks, particularly when fine-tuned on large, domain-relevant datasets. However, they remain vulnerable to backdoor attacks, where adversaries embed malicious behaviors using trigger patterns in the training data. These triggers remain dormant during normal usage, but, when activated, can cause targeted misclassifications. In this work, we investigate the internal behavior of backdoored pre-trained encoder-based language models, focusing on the consistent shift in attention and gradient attribution when processing poisoned inputs; where the trigger token dominates both attention and gradient signals, overriding the surrounding context. We propose an inference-time defense that constructs anomaly scores by combining token-level attention and gradient information. Extensive experiments on text classification tasks across diverse backdoor attack scenarios demonstrate that our method significantly reduces attack success rates compared to existing baselines. Furthermore, we provide an interpretability-driven analysis of the scoring mechanism, shedding light on trigger localization and the robustness of the proposed defense.

[68] Improving Consistency in Retrieval-Augmented Systems with Group Similarity Rewards

Faisal Hamman, Chenyang Zhu, Anoop Kumar, Xujun Peng, Sanghamitra Dutta, Daben Liu, Alfy Samuel

Main category: cs.CL

TL;DR: The paper addresses inconsistency issues in RAG systems by proposing a framework to evaluate and improve information consistency across semantically equivalent queries using a novel RL approach called PS-GRPO.

DetailsMotivation: RAG systems deployed in high-stakes domains often produce inconsistent outputs for semantically equivalent queries, undermining trust and reliability. This work focuses on ensuring information consistency across such queries.

Method: Proposes PS-GRPO (Paraphrased Set Group Relative Policy Optimization), an RL approach that uses multiple rollouts across paraphrased sets with group similarity rewards. Also introduces a scalable approximation for efficient training.

Result: Con-RAG significantly improves both consistency and accuracy across short-form, multi-hop, and long-form QA benchmarks, even without explicit ground-truth supervision.

Conclusion: The work provides practical solutions for evaluating and building reliable RAG systems suitable for safety-critical deployments by addressing consistency issues through principled evaluation and training methods.

Abstract: RAG systems are increasingly deployed in high-stakes domains where users expect outputs to be consistent across semantically equivalent queries. However, existing systems often exhibit significant inconsistencies due to variability in both the retriever and generator (LLM), undermining trust and reliability. In this work, we focus on information consistency, i.e., the requirement that outputs convey the same core content across semantically equivalent inputs. We introduce a principled evaluation framework that decomposes RAG consistency into retriever-level, generator-level, and end-to-end components, helping identify inconsistency sources. To improve consistency, we propose Paraphrased Set Group Relative Policy Optimization (PS-GRPO), an RL approach that leverages multiple rollouts across paraphrased set to assign group similarity rewards. We leverage PS-GRPO to achieve Information Consistent RAG (Con-RAG), training the generator to produce consistent outputs across paraphrased queries and remain robust to retrieval-induced variability. Because exact reward computation over paraphrase sets is computationally expensive, we also introduce a scalable approximation method that retains effectiveness while enabling efficient, large-scale training. Empirical evaluations across short-form, multi-hop, and long-form QA benchmarks demonstrate that Con-RAG significantly improves both consistency and accuracy over strong baselines, even in the absence of explicit ground-truth supervision. Our work provides practical solutions for evaluating and building reliable RAG systems for safety-critical deployments.

[69] Time Is Effort: Estimating Human Post-Editing Time for Grammar Error Correction Tool Evaluation

Ankit Vadehra, Bill Johnson, Gene Saunders, Pascal Poupart

Main category: cs.CL

TL;DR: PEET is a human-focused evaluation metric that ranks GEC tools by estimating post-editing time savings, using a new large-scale dataset with time annotations.

DetailsMotivation: To quantify how much effort GEC tools can save users during text editing by measuring actual post-editing time rather than just technical accuracy.

Method: Created first large-scale dataset of post-editing time annotations for BEA19 and CoNLL14 GEC test datasets, then developed PEET scorer to estimate time-to-correct.

Result: GEC tools significantly reduce editing time; determining whether correction is needed and edits like paraphrasing/punctuation have biggest time impact. PEET correlates well with human rankings.

Conclusion: PEET provides a human-centric evaluation approach for GEC tool usability that better reflects real-world editing efficiency than technical metrics alone.

Abstract: Text editing can involve several iterations of revision. Incorporating an efficient Grammar Error Correction (GEC) tool in the initial correction round can significantly impact further human editing effort and final text quality. This raises an interesting question to quantify GEC Tool usability: How much effort can the GEC Tool save users? We present the first large-scale dataset of post-editing (PE) time annotations and corrections for two English GEC test datasets (BEA19 and CoNLL14). We introduce Post-Editing Effort in Time (PEET) for GEC Tools as a human-focused evaluation scorer to rank any GEC Tool by estimating PE time-to-correct. Using our dataset, we quantify the amount of time saved by GEC Tools in text editing. Analyzing the edit type indicated that determining whether a sentence needs correction and edits like paraphrasing and punctuation changes had the greatest impact on PE time. Finally, comparison with human rankings shows that PEET correlates well with technical effort judgment, providing a new human-centric direction for evaluating GEC tool usability. We release our dataset and code at: https://github.com/ankitvad/PEET_Scorer.

[70] SECA: Semantically Equivalent and Coherent Attacks for Eliciting LLM Hallucinations

Buyun Liang, Liangzu Peng, Jinqi Luo, Darshan Thaker, Kwan Ho Ryan Chan, René Vidal

Main category: cs.CL

TL;DR: SECA proposes realistic adversarial attacks for eliciting LLM hallucinations through semantically equivalent and coherent prompt modifications, achieving higher attack success rates than existing methods.

DetailsMotivation: Current adversarial attacks for LLM hallucination elicitation produce unrealistic prompts that don't reflect real-world scenarios, limiting practical insights into how hallucinations occur.

Method: Formulates realistic attacks as constrained optimization under semantic equivalence and coherence constraints, using a constraint-preserving zeroth-order method to search for adversarial prompts.

Result: SECA achieves higher attack success rates on open-ended multiple-choice question answering tasks while maintaining almost no constraint violations compared to existing methods.

Conclusion: LLMs are sensitive to realistic and plausible prompt variations, highlighting the need for more robust models that can withstand semantically equivalent attacks.

Abstract: Large Language Models (LLMs) are increasingly deployed in high-risk domains. However, state-of-the-art LLMs often produce hallucinations, raising serious concerns about their reliability. Prior work has explored adversarial attacks for hallucination elicitation in LLMs, but it often produces unrealistic prompts, either by inserting gibberish tokens or by altering the original meaning. As a result, these approaches offer limited insight into how hallucinations may occur in practice. While adversarial attacks in computer vision often involve realistic modifications to input images, the problem of finding realistic adversarial prompts for eliciting LLM hallucinations has remained largely underexplored. To address this gap, we propose Semantically Equivalent and Coherent Attacks (SECA) to elicit hallucinations via realistic modifications to the prompt that preserve its meaning while maintaining semantic coherence. Our contributions are threefold: (i) we formulate finding realistic attacks for hallucination elicitation as a constrained optimization problem over the input prompt space under semantic equivalence and coherence constraints; (ii) we introduce a constraint-preserving zeroth-order method to effectively search for adversarial yet feasible prompts; and (iii) we demonstrate through experiments on open-ended multiple-choice question answering tasks that SECA achieves higher attack success rates while incurring almost no constraint violations compared to existing methods. SECA highlights the sensitivity of both open-source and commercial gradient-inaccessible LLMs to realistic and plausible prompt variations. Code is available at https://github.com/Buyun-Liang/SECA.

[71] Large Language Models Preserve Semantic Isotopies in Story Continuations

Marc Cavazza

Main category: cs.CL

TL;DR: LLMs preserve semantic isotopies in generated text continuations, maintaining structural and semantic coherence across multiple properties.

DetailsMotivation: To investigate whether Large Language Models preserve semantic isotopies (recurring semantic elements) in generated text, extending previous research on distributional and structural semantics.

Method: Used 10,000 ROCStories prompts completed by five LLMs, validated GPT-4o’s isotopy extraction capability on a linguistic benchmark, then analyzed structural (coverage, density, spread) and semantic properties of isotopies in generated stories.

Result: LLM completion within a given token horizon preserves semantic isotopies across multiple structural and semantic properties.

Conclusion: Large Language Models effectively maintain semantic isotopies in text generation, demonstrating their ability to preserve semantic coherence in extended text continuations.

Abstract: In this work, we explore the relevance of textual semantics to Large Language Models (LLMs), extending previous insights into the connection between distributional semantics and structural semantics. We investigate whether LLM-generated texts preserve semantic isotopies. We design a story continuation experiment using 10,000 ROCStories prompts completed by five LLMs. We first validate GPT-4o’s ability to extract isotopies from a linguistic benchmark, then apply it to the generated stories. We then analyze structural (coverage, density, spread) and semantic properties of isotopies to assess how they are affected by completion. Results show that LLM completion within a given token horizon preserves semantic isotopies across multiple properties.

[72] Good Intentions Beyond ACL: Who Does NLP for Social Good, and Where?

Grace LeFevre, Qingcheng Zeng, Adam Leif, Jason Jewell, Denis Peskoff, Rob Voigt

Main category: cs.CL

TL;DR: This study maps the landscape of NLP for Social Good (NLP4SG) and reveals that ACL authors are more likely to publish NLP4SG work outside ACL venues, and most NLP4SG publications come from non-ACL authors in non-ACL venues.

DetailsMotivation: To understand the distribution and patterns of NLP for Social Good research across different author communities and publication venues, particularly examining the role of ACL community versus broader research landscape.

Method: Author- and venue-level analysis of NLP4SG publications, quantifying proportions of work addressing social good concerns within/beyond ACL community, by both core ACL contributors and non-ACL authors.

Result: Two key findings: 1) ACL authors are dramatically more likely to publish NLP4SG work in venues outside ACL; 2) Majority of NLP4SG publications are by non-ACL authors in non-ACL venues.

Conclusion: The findings have important implications for agenda-setting considerations regarding NLP4SG within the ACL community, suggesting the need to examine publication patterns and community engagement strategies.

Abstract: The social impact of Natural Language Processing (NLP) is increasingly important, with a rising community focus on initiatives related to NLP for Social Good (NLP4SG). Indeed, in recent years, almost 20% of all papers in the ACL Anthology address topics related to social good as defined by the UN Sustainable Development Goals (Adauto et al., 2023). In this study, we take an author- and venue-level perspective to map the landscape of NLP4SG, quantifying the proportion of work addressing social good concerns both within and beyond the ACL community, by both core ACL contributors and non-ACL authors. With this approach we discover two surprising facts about the landscape of NLP4SG. First, ACL authors are dramatically more likely to do work addressing social good concerns when publishing in venues outside of ACL. Second, the vast majority of publications using NLP techniques to address concerns of social good are done by non-ACL authors in venues outside of ACL. We discuss the implications of these findings on agenda-setting considerations for the ACL community related to NLP4SG.

[73] On the Role of Unobserved Sequences on Sample-based Uncertainty Quantification for LLMs

Lucie Kunitomo-Jacquin, Edison Marrese-Taylor, Ken Fukuda

Main category: cs.CL

TL;DR: The paper argues that accounting for the probability of unobserved sequences is crucial for improving LLM uncertainty quantification methods based on entropy estimation.

DetailsMotivation: Quantifying uncertainty in LLMs is important for safety-critical applications to detect hallucinations. Current entropy-based methods rely on observed sequences but neglect unobserved ones.

Method: The authors advocate for integrating the probability of unobserved sequences into existing entropy-based uncertainty quantification methods for LLMs.

Result: Experimental results show that considering unobserved sequences significantly enhances LLM uncertainty quantification performance.

Conclusion: Future research should incorporate the probability of unobserved sequences to improve LLM uncertainty quantification methods.

Abstract: Quantifying uncertainty in large language models (LLMs) is important for safety-critical applications because it helps spot incorrect answers, known as hallucinations. One major trend of uncertainty quantification methods is based on estimating the entropy of the distribution of the LLM’s potential output sequences. This estimation is based on a set of output sequences and associated probabilities obtained by querying the LLM several times. In this paper, we advocate and experimentally show that the probability of unobserved sequences plays a crucial role, and we recommend future research to integrate it to enhance such LLM uncertainty quantification methods.

[74] Mitigating Forgetting Between Supervised and Reinforcement Learning Yields Stronger Reasoners

Xiangchi Yuan, Xiang Chen, Tong Yu, Dachuan Shi, Can Jin, Wenke Lee, Saayan Mitra

Main category: cs.CL

TL;DR: A plug-and-play framework that dynamically integrates SFT into RL by selecting challenging examples for SFT, achieving state-of-the-art reasoning performance with significantly reduced data requirements.

DetailsMotivation: RL algorithms struggle to expand reasoning boundaries as they learn only from their own trajectories, while SFT requires large-scale data and risks overfitting. Combining SFT and RL faces challenges of data inefficiency, algorithm-specific designs, and catastrophic forgetting.

Method: Dynamically integrates SFT into RL by selecting challenging examples for SFT, uses high-entropy tokens for loss calculation to mitigate catastrophic forgetting, and freezes parameters critical for RL.

Result: Achieves state-of-the-art reasoning performance using only 1.5% of SFT data and 20.4% of RL data compared to prior state-of-the-art methods.

Conclusion: Provides an efficient and plug-and-play solution for combining SFT and RL in reasoning post-training, effectively addressing data inefficiency and catastrophic forgetting challenges.

Abstract: Large Language Models (LLMs) show strong reasoning abilities, often amplified by Chain-of-Thought (CoT) prompting and reinforcement learning (RL). Although RL algorithms can substantially improve reasoning, they struggle to expand reasoning boundaries because they learn from their own reasoning trajectories rather than acquiring external knowledge. Supervised fine-tuning (SFT) offers complementary benefits but typically requires large-scale data and risks overfitting. Recent attempts to combine SFT and RL face three main challenges: data inefficiency, algorithm-specific designs, and catastrophic forgetting. We propose a plug-and-play framework that dynamically integrates SFT into RL by selecting challenging examples for SFT. This approach reduces SFT data requirements and remains agnostic to the choice of RL or SFT algorithm. To mitigate catastrophic forgetting of RL-acquired skills during SFT, we select high-entropy tokens for loss calculation and freeze parameters identified as critical for RL. Our method achieves state-of-the-art (SoTA) reasoning performance using only 1.5% of the SFT data and 20.4% of the RL data used by prior SoTA, providing an efficient and plug-and-play solution for combining SFT and RL in reasoning post-training.

[75] Compressed Convolutional Attention: Efficient Attention in a Compressed Latent Space

Tomas Figliolia, Nicholas Alonso, Rishi Iyer, Quentin Anthony, Beren Millidge

Main category: cs.CL

TL;DR: CCA and CCGQA are novel attention methods that compress queries, keys, and values into a shared latent space, dramatically reducing parameters, KV-cache, and FLOPs while maintaining performance.

DetailsMotivation: Multi-headed Attention's quadratic compute and linearly growing KV-cache make long-context transformers expensive to train and serve, with prior methods only addressing cache size but not compute costs.

Method: Compressed Convolutional Attention (CCA) down-projects queries, keys, and values and performs attention in a shared latent space. Combined with head-sharing to form CCGQA for further optimization.

Result: CCGQA outperforms GQA and MLA at equal KV-cache compression, achieves 8x KV-cache compression with no performance drop, reduces prefill latency by 1.7x and backward by 1.3x on H100 GPUs.

Conclusion: CCA and CCGQA provide superior compute-bandwidth Pareto frontier optimization, enabling substantial reductions in training and inference costs while maintaining model quality across dense and MoE architectures.

Abstract: Multi-headed Attention’s (MHA) quadratic compute and linearly growing KV-cache make long-context transformers expensive to train and serve. Prior works such as Grouped Query Attention (GQA) and Multi-Latent Attention (MLA) shrink the cache, speeding decode, but leave compute, which determines prefill and training speed, largely unchanged. We introduce Compressed Convolutional Attention (CCA), a novel attention method which down-projects queries, keys, and values and performs the entire attention operation inside the shared latent space. This simple design dramatically cuts parameters, KV-cache, and FLOPs all at once by the desired compression factor. Because CCA is orthogonal to head-sharing, we combine the two to form Compressed Convolutional Grouped Query Attention (CCGQA), which further tightens the compute-bandwidth Pareto frontier so that users can tune compression toward either FLOP or memory limits without sacrificing quality. Experiments show that CCGQA consistently outperforms both GQA and MLA at equal KV-cache compression on dense and MoE models. Additionally, we show that CCGQA outperforms all other attention methods on MoE models with half the KV-cache of GQA and MLA, achieving an 8x KV-cache compression with no drop in performance compared to standard MHA. CCA and CCGQA also dramatically reduce the FLOP cost of attention which leads to substantially faster training and prefill than existing methods. On H100 GPUs, our fused CCA/CCGQA kernel reduces prefill latency by about 1.7x at a sequence length of 16k relative to MHA, and accelerates backward by about 1.3x.

[76] Psychological Steering in LLMs: An Evaluation of Effectiveness and Trustworthiness

Amin Banayeeanzade, Ala N. Tak, Fatemeh Bahrani, Anahita Bolourani, Leonardo Blas, Emilio Ferrara, Jonathan Gratch, Sai Praneeth Karimireddy

Main category: cs.CL

TL;DR: PsySET is a psychological benchmark for evaluating LLM steering effectiveness and trustworthiness across emotion and personality domains, revealing that prompting works well but has limited intensity control, while vector injections offer finer control with slight quality reduction.

DetailsMotivation: To enable rich, human-centered interactions in socially interactive settings by controlling LLMs' emotional states and personality traits, requiring evaluation of steering effectiveness and trustworthiness.

Method: Developed PsySET benchmark spanning four LLM models with various steering strategies including prompting, fine-tuning, and representation engineering, assessing safety, truthfulness, fairness, and ethics.

Result: Prompting is consistently effective but limited in intensity control; vector injections achieve finer controllability with slight output quality reduction. Positive emotions like joy can degrade robustness to adversarial factuality, lower privacy awareness, and increase preferential bias.

Conclusion: Establishes the first holistic evaluation of emotion and personality steering, offering insights into interpretability and reliability for socially interactive applications, highlighting idiosyncratic effects and behavioral shifts.

Abstract: The ability to control LLMs’ emulated emotional states and personality traits is essential for enabling rich, human-centered interactions in socially interactive settings. We introduce PsySET, a Psychologically-informed benchmark to evaluate LLM Steering Effectiveness and Trustworthiness across the emotion and personality domains. Our study spans four models from different LLM families paired with various steering strategies, including prompting, fine-tuning, and representation engineering. Our results indicate that prompting is consistently effective but limited in intensity control, whereas vector injections achieve finer controllability while slightly reducing output quality. Moreover, we explore the trustworthiness of steered LLMs by assessing safety, truthfulness, fairness, and ethics, highlighting potential side effects and behavioral shifts. Notably, we observe idiosyncratic effects; for instance, even a positive emotion like joy can degrade robustness to adversarial factuality, lower privacy awareness, and increase preferential bias. Meanwhile, anger predictably elevates toxicity yet strengthens leakage resistance. Our framework establishes the first holistic evaluation of emotion and personality steering, offering insights into its interpretability and reliability for socially interactive applications.

[77] GenQuest: An LLM-based Text Adventure Game for Language Learners

Qiao Wang, Adnan Labib, Robert Swier, Michael Hofmeyr, Zheng Yuan

Main category: cs.CL

TL;DR: GenQuest is an AI-powered text adventure game that uses LLMs to help EFL learners improve English through interactive storytelling with adaptive content and vocabulary support.

DetailsMotivation: To create an engaging language learning tool that combines immersive storytelling with personalized language instruction for EFL learners.

Method: Uses LLMs to generate dynamic choose-your-own-adventure narratives with branching decision points, proficiency-level content adaptation, and in-context vocabulary explanations.

Result: Pilot study with Chinese university students showed promising vocabulary gains and positive user feedback, though participants suggested improvements to narrative length/quality and requested multi-modal content.

Conclusion: GenQuest demonstrates the potential of LLM-driven interactive storytelling for language learning, with future improvements needed in narrative design and multi-modal integration.

Abstract: GenQuest is a generative text adventure game that leverages Large Language Models (LLMs) to facilitate second language learning through immersive, interactive storytelling. The system engages English as a Foreign Language (EFL) learners in a collaborative “choose-your-own-adventure” style narrative, dynamically generated in response to learner choices. Game mechanics such as branching decision points and story milestones are incorporated to maintain narrative coherence while allowing learner-driven plot development. Key pedagogical features include content generation tailored to each learner’s proficiency level, and a vocabulary assistant that provides in-context explanations of learner-queried text strings, ranging from words and phrases to sentences. Findings from a pilot study with university EFL students in China indicate promising vocabulary gains and positive user perceptions. Also discussed are suggestions from participants regarding the narrative length and quality, and the request for multi-modal content such as illustrations.

[78] GRACE: Generative Representation Learning via Contrastive Policy Optimization

Jiashuo Sun, Shixuan Liu, Zhaochen Su, Xianrui Zhong, Pengcheng Jiang, Bowen Jin, Peiran Li, Weijia Shi, Jiawei Han

Main category: cs.CL

TL;DR: GRACE is a novel framework that treats contrastive signals as rewards to guide a generative policy, enabling LLMs to produce interpretable rationales and high-quality embeddings through policy gradient optimization.

DetailsMotivation: Current methods for training LLMs as text encoders discard generative and reasoning capabilities in favor of static embeddings, treating models as black boxes without transparency.

Method: GRACE uses policy gradient optimization with a multi-component reward function that maximizes similarity between query positive pairs and minimizes similarity with negatives. The LLM produces human-interpretable rationales that are encoded into embeddings via mean pooling.

Result: On MTEB benchmark, GRACE achieves broad cross-category gains: supervised setting improves overall score by 11.5% over base models, and unsupervised variant adds 6.9%, while preserving general capabilities.

Conclusion: GRACE unifies representation learning with generation to produce stronger embeddings and transparent rationales, transforming LLMs from opaque encoders into interpretable agents with inspectable reasoning processes.

Abstract: Prevailing methods for training Large Language Models (LLMs) as text encoders rely on contrastive losses that treat the model as a black box function, discarding its generative and reasoning capabilities in favor of static embeddings. We introduce GRACE (Generative Representation Learning via Contrastive Policy Optimization), a novel framework that reimagines contrastive signals not as losses to be minimized, but as rewards that guide a generative policy. In GRACE, the LLM acts as a policy that produces explicit, human-interpretable rationales–structured natural language explanations of its semantic understanding. These rationales are then encoded into high-quality embeddings via mean pooling. Using policy gradient optimization, we train the model with a multi-component reward function that maximizes similarity between query positive pairs and minimizes similarity with negatives. This transforms the LLM from an opaque encoder into an interpretable agent whose reasoning process is transparent and inspectable. On MTEB benchmark, GRACE yields broad cross category gains: averaged over four backbones, the supervised setting improves overall score by 11.5% over base models, and the unsupervised variant adds 6.9%, while preserving general capabilities. This work treats contrastive objectives as rewards over rationales, unifying representation learning with generation to produce stronger embeddings and transparent rationales. The model, data and code are available at https://github.com/GasolSun36/GRACE.

[79] Fine-grained auxiliary learning for real-world product recommendation

Mario Almagro, Diego Ortego, David Jimenez

Main category: cs.CL

TL;DR: ALC is an auxiliary learning strategy that improves recommendation coverage by learning fine-grained embeddings using hardest negatives in training batches.

DetailsMotivation: Real-world production systems require high coverage (proportion of automated recommendations), but existing models often overlook this requirement when deployed.

Method: Proposes ALC with two training objectives that leverage hardest negatives in batches to create discriminative training signals between positive and negative items.

Result: Validated on two product recommendation datasets (LF-AmazonTitles-131K and proprietary Tech and Durables) using three extreme multi-label classification approaches, achieving state-of-the-art coverage rates when combined with threshold-consistent margin loss.

Conclusion: ALC effectively boosts coverage in product recommendation systems by learning fine-grained embeddings through auxiliary learning objectives.

Abstract: Product recommendation is the task of recovering the closest items to a given query within a large product corpora. Generally, one can determine if top-ranked products are related to the query by applying a similarity threshold; exceeding it deems the product relevant, otherwise manual revision is required. Despite being a well-known problem, the integration of these models in real-world systems is often overlooked. In particular, production systems have strong coverage requirements, i.e., a high proportion of recommendations must be automated. In this paper we propose ALC , an Auxiliary Learning strategy that boosts Coverage through learning fine-grained embeddings. Concretely, we introduce two training objectives that leverage the hardest negatives in the batch to build discriminative training signals between positives and negatives. We validate ALC using three extreme multi-label classification approaches in two product recommendation datasets; LF-AmazonTitles-131K and Tech and Durables (proprietary), demonstrating state-of-the-art coverage rates when combined with a recent threshold-consistent margin loss.

[80] Can LLMs Detect Ambiguous Plural Reference? An Analysis of Split-Antecedent and Mereological Reference

Dang Anh, Rick Nouwen, Massimo Poesio

Main category: cs.CL

TL;DR: LLMs show some awareness of plural reference ambiguity but struggle with human-like interpretation and ambiguity detection without explicit instruction.

DetailsMotivation: To study how LLMs represent and interpret plural reference in ambiguous and unambiguous contexts, comparing their performance to human preferences.

Method: Designed experiments using next-token prediction tasks for pronoun production, pronoun interpretation, and ambiguity detection with different prompting strategies.

Result: LLMs are sometimes aware of possible referents of ambiguous pronouns but don’t follow human reference preferences, especially when interpretations aren’t explicitly mentioned. They struggle to identify ambiguity without direct instruction.

Conclusion: LLMs exhibit inconsistencies across different experiment types and have limitations in human-like plural reference processing, particularly in ambiguity detection and interpretation alignment.

Abstract: Our goal is to study how LLMs represent and interpret plural reference in ambiguous and unambiguous contexts. We ask the following research questions: (1) Do LLMs exhibit human-like preferences in representing plural reference? (2) Are LLMs able to detect ambiguity in plural anaphoric expressions and identify possible referents? To address these questions, we design a set of experiments, examining pronoun production using next-token prediction tasks, pronoun interpretation, and ambiguity detection using different prompting strategies. We then assess how comparable LLMs are to humans in formulating and interpreting plural reference. We find that LLMs are sometimes aware of possible referents of ambiguous pronouns. However, they do not always follow human reference when choosing between interpretations, especially when the possible interpretation is not explicitly mentioned. In addition, they struggle to identify ambiguity without direct instruction. Our findings also reveal inconsistencies in the results across different types of experiments.

[81] FedSRD: Sparsify-Reconstruct-Decompose for Communication-Efficient Federated Large Language Models Fine-Tuning

Guochen Yan, Luyuan Xie, Qingni Shen, Yuejian Fang, Zhonghai Wu

Main category: cs.CL

TL;DR: FedSRD is a communication-efficient federated learning framework that addresses the bottleneck of LoRA parameter transmission in heterogeneous networks through sparsification, reconstruction, and decomposition techniques.

DetailsMotivation: Current LLM training faces data scarcity in specialized domains, and federated learning with LoRA encounters high communication overhead and parameter conflicts in decentralized settings.

Method: FedSRD uses importance-aware sparsification to reduce uploaded LoRA parameters, server-side reconstruction and aggregation in full-rank space to mitigate conflicts, and decomposition into sparse low-rank format for efficient broadcasting. FedSRD-e is a computational-efficient variant.

Result: Experimental results on 10 benchmarks show up to 90% reduction in communication costs while improving model performance on heterogeneous client data.

Conclusion: FedSRD provides an effective solution for communication-efficient federated fine-tuning of LLMs, enabling sustainable AI development on decentralized Web infrastructure.

Abstract: The current paradigm of training large language models (LLMs) on publicly available Web data is becoming unsustainable, with high-quality data sources in specialized domains nearing exhaustion. Federated Learning (FL) emerges as a practical solution for the next generation of AI on a decentralized Web, enabling privacy-preserving collaborative fine-tuning by leveraging private data distributed across a global client base. While Low-Rank Adaptation (LoRA) is the standard for efficient fine-tuning, its application in federated settings presents a critical challenge: communication overhead remains a significant bottleneck across the Web’s heterogeneous network conditions. The structural redundancy within LoRA parameters not only incurs a heavy communication burden but also introduces conflicts when aggregating client updates. To address this, we propose FedSRD, a Sparsify-Reconstruct-Decompose framework designed for communication-efficient FL. We first introduce an importance-aware sparsification method that preserves the structural integrity of LoRA updates to reduce the uploaded parameter count. The server then reconstructs and aggregates these updates in a full-rank space to mitigate conflicts. Finally, it decomposes the global update into a sparse low-rank format for broadcast, ensuring a symmetrically efficient cycle. We also propose an efficient variant, FedSRD-e, to reduce computational overhead. Experimental results on 10 benchmarks demonstrate that our framework significantly reduces communication costs by up to 90% while even improving model performance on heterogeneous client data.

[82] Contrastive Learning Using Graph Embeddings for Domain Adaptation of Language Models in the Process Industry

Anastasia Zhukova, Jonas Lührs, Christian E. Matt, Bela Gipp

Main category: cs.CL

TL;DR: SciNCL, a graph-aware contrastive learning method, is adapted for process industry text logs structured as sparse knowledge graphs, achieving significant performance gains over mE5-large while being much smaller.

DetailsMotivation: To enhance language models for process industry by incorporating domain-specific knowledge from sparse knowledge graphs found in operational text logs.

Method: Applied SciNCL’s neighborhood contrastive learning methodology to process industry domain, using triplets derived from graph embeddings for fine-tuning language models.

Result: Models fine-tuned with graph embedding triplets outperformed mE5-large by 9.8-14.3% on PITEB benchmark while being 3-5 times smaller in size.

Conclusion: Graph-aware contrastive learning effectively improves language model performance in process industry applications, enabling smaller models to outperform larger state-of-the-art encoders.

Abstract: Recent trends in NLP utilize knowledge graphs (KGs) to enhance pretrained language models by incorporating additional knowledge from the graph structures to learn domain-specific terminology or relationships between documents that might otherwise be overlooked. This paper explores how SciNCL, a graph-aware neighborhood contrastive learning methodology originally designed for scientific publications, can be applied to the process industry domain, where text logs contain crucial information about daily operations and are often structured as sparse KGs. Our experiments demonstrate that language models fine-tuned with triplets derived from GE outperform a state-of-the-art mE5-large text encoder by 9.8-14.3% (5.4-8.0p) on the proprietary process industry text embedding benchmark (PITEB) while being 3-5 times smaller in size.

[83] Evaluating LLMs for Demographic-Targeted Social Bias Detection: A Comprehensive Benchmark Study

Ayan Majumdar, Feihao Chen, Jinghui Li, Xiaozhen Wang

Main category: cs.CL

TL;DR: This paper presents a comprehensive evaluation of LLMs for detecting demographic-targeted social biases in text corpora, finding that fine-tuned smaller models show promise but gaps remain in detecting multi-demographic biases.

DetailsMotivation: Large-scale web-scraped text corpora contain harmful demographic-targeted social biases, creating regulatory need for scalable bias-detection methods. Prior work is narrow in scope, focusing on single content types and limited demographics.

Method: Developed a comprehensive evaluation framework for English texts using multi-label bias detection with demographic-focused taxonomy. Evaluated models across scales and techniques including prompting, in-context learning, and fine-tuning using twelve diverse datasets.

Result: Fine-tuned smaller models show promise for scalable bias detection, but analyses reveal persistent gaps across demographic axes and multi-demographic targeted biases.

Conclusion: There is a need for more effective and scalable auditing frameworks to address the limitations in detecting multi-demographic biases across different demographic axes.

Abstract: Large-scale web-scraped text corpora used to train general-purpose AI models often contain harmful demographic-targeted social biases, creating a regulatory need for data auditing and developing scalable bias-detection methods. Although prior work has investigated biases in text datasets and related detection methods, these studies remain narrow in scope. They typically focus on a single content type (e.g., hate speech), cover limited demographic axes, overlook biases affecting multiple demographics simultaneously, and analyze limited techniques. Consequently, practitioners lack a holistic understanding of the strengths and limitations of recent large language models (LLMs) for automated bias detection. In this study, we present a comprehensive evaluation framework aimed at English texts to assess the ability of LLMs in detecting demographic-targeted social biases. To align with regulatory requirements, we frame bias detection as a multi-label task using a demographic-focused taxonomy. We then conduct a systematic evaluation with models across scales and techniques, including prompting, in-context learning, and fine-tuning. Using twelve datasets spanning diverse content types and demographics, our study demonstrates the promise of fine-tuned smaller models for scalable detection. However, our analyses also expose persistent gaps across demographic axes and multi-demographic targeted biases, underscoring the need for more effective and scalable auditing frameworks.

[84] FT-MDT: Extracting Decision Trees from Medical Texts via a Novel Low-rank Adaptation Method

Yuheng Li, Jiechao Gao, Wei Han, Wenwen Ouyang, Wei Zhu, Hui Yi Leong

Main category: cs.CL

TL;DR: PI-LoRA is a novel low-rank adaptation method that automatically extracts medical decision trees from clinical texts by integrating gradient path information for better rank allocation, achieving state-of-the-art performance with reduced complexity.

DetailsMotivation: Current medical decision tree construction methods rely heavily on manual annotation, which is time-consuming and laborious. There is a need for automated methods to extract MDTs from clinical guidelines and textbooks.

Method: Proposed PI-LoRA (Path-Integrated LoRA), a low-rank adaptation method that integrates gradient path information to capture synergistic effects between modules, enabling effective rank allocation and pruning of less important modules.

Result: Extensive experiments show PI-LoRA significantly outperforms existing parameter-efficient fine-tuning approaches for Text2MDT task, achieving better accuracy with substantially reduced model complexity.

Conclusion: PI-LoRA achieves state-of-the-art results while maintaining a lightweight architecture, making it particularly suitable for clinical decision support systems with limited computational resources.

Abstract: Knowledge of the medical decision process, which can be modeled as medical decision trees (MDTs), is critical to building clinical decision support systems. However, current MDT construction methods rely heavily on time-consuming and laborious manual annotation. To address this challenge, we propose PI-LoRA (Path-Integrated LoRA), a novel low-rank adaptation method for automatically extracting MDTs from clinical guidelines and textbooks. We integrate gradient path information to capture synergistic effects between different modules, enabling more effective and reliable rank allocation. This framework ensures that the most critical modules receive appropriate rank allocations while less important ones are pruned, resulting in a more efficient and accurate model for extracting medical decision trees from clinical texts. Extensive experiments on medical guideline datasets demonstrate that our PI-LoRA method significantly outperforms existing parameter-efficient fine-tuning approaches for the Text2MDT task, achieving better accuracy with substantially reduced model complexity. The proposed method achieves state-of-the-art results while maintaining a lightweight architecture, making it particularly suitable for clinical decision support systems where computational resources may be limited.

[85] FocusMed: A Large Language Model-based Framework for Enhancing Medical Question Summarization with Focus Identification

Chao Liu, Ling Luo, Tengxiao Lv, Huan Zhuang, Lejing Yu, Jian Wang, Hongfei Lin

Main category: cs.CL

TL;DR: Proposes a core focus guidance framework for medical question summarization that improves focus identification and reduces hallucinations in LLMs.

DetailsMotivation: Consumer health questions contain redundant information and non-professional terms, making diagnosis inefficient. Existing methods struggle with poor question focus identification and model hallucination.

Method: Uses prompt templates to extract core focus from CHQs, constructs fine-tuning datasets with CHQ-FAQ pairs, and implements multi-dimensional quality evaluation and selection mechanism.

Result: Achieves state-of-the-art performance on two MQS datasets across all evaluation metrics, with significant improvement in focus identification and hallucination reduction.

Conclusion: The proposed framework effectively enhances LLMs’ ability to generate faithful medical question summaries by improving focus identification and mitigating hallucinations.

Abstract: With the rapid development of online medical platforms, consumer health questions (CHQs) are inefficient in diagnosis due to redundant information and frequent non-professional terms. The medical question summary (MQS) task aims to transform CHQs into streamlined doctors’ frequently asked questions (FAQs), but existing methods still face challenges such as poor identification of question focus and model hallucination. This paper explores the potential of large language models (LLMs) in the MQS task and finds that direct fine-tuning is prone to focus identification bias and generates unfaithful content. To this end, we propose an optimization framework based on core focus guidance. First, a prompt template is designed to drive the LLMs to extract the core focus from the CHQs that is faithful to the original text. Then, a fine-tuning dataset is constructed in combination with the original CHQ-FAQ pairs to improve the ability to identify the focus of the question. Finally, a multi-dimensional quality evaluation and selection mechanism is proposed to comprehensively improve the quality of the summary from multiple dimensions. We conduct comprehensive experiments on two widely-adopted MQS datasets using three established evaluation metrics. The proposed framework achieves state-of-the-art performance across all measures, demonstrating a significant boost in the model’s ability to identify critical focus of questions and a notable mitigation of hallucinations. The source codes are freely available at https://github.com/DUT-LiuChao/FocusMed.

[86] Multi-Agent Tool-Integrated Policy Optimization

Zhanfeng Mo, Xingxuan Li, Yuntao Chen, Lidong Bing

Main category: cs.CL

TL;DR: MATPO enables reinforcement learning training of multi-agent tool-integrated frameworks within a single LLM instance, outperforming single-agent baselines by 18.38% with better robustness to noisy tools.

DetailsMotivation: Existing single-agent approaches for tool-integrated planning suffer from limited context length and noisy tool responses, while multi-agent frameworks lack effective reinforcement learning post-training methods.

Method: Multi-Agent Tool-Integrated Policy Optimization (MATPO) trains distinct planner and worker roles within a single LLM using role-specific prompts via reinforcement learning, with principled credit assignment across agent rollouts.

Result: Experiments on GAIA-text, WebWalkerQA, and FRAMES show MATPO consistently outperforms single-agent baselines by 18.38% average relative improvement and exhibits greater robustness to noisy tool outputs.

Conclusion: Unifying multiple agent roles within a single LLM is effective, providing practical insights for stable and efficient multi-agent RL training while eliminating the memory overhead of deploying multiple LLMs.

Abstract: Large language models (LLMs) increasingly rely on multi-turn tool-integrated planning for knowledge-intensive and complex reasoning tasks. Existing implementations typically rely on a single agent, but they suffer from limited context length and noisy tool responses. A natural solution is to adopt a multi-agent framework with planner- and worker-agents to manage context. However, no existing methods support effective reinforcement learning post-training of tool-integrated multi-agent frameworks. To address this gap, we propose Multi-Agent Tool-Integrated Policy Optimization (MATPO), which enables distinct roles (planner and worker) to be trained within a single LLM instance using role-specific prompts via reinforcement learning. MATPO is derived from a principled credit assignment mechanism across planner and worker rollouts. This design eliminates the need to deploy multiple LLMs, which would be memory-intensive, while preserving the benefits of specialization. Experiments on GAIA-text, WebWalkerQA, and FRAMES show that MATPO consistently outperforms single-agent baselines by an average of 18.38% relative improvement in performance and exhibits greater robustness to noisy tool outputs. Our findings highlight the effectiveness of unifying multiple agent roles within a single LLM and provide practical insights for stable and efficient multi-agent RL training.

[87] TiTok: Transfer Token-level Knowledge via Contrastive Excess to Transplant LoRA

Chanjoo Jung, Jaehyung Kim

Main category: cs.CL

TL;DR: TiTok enables LoRA transplantation across different LLMs through token-level knowledge transfer without additional models, achieving +4~8% performance gains over baselines.

DetailsMotivation: Current PEFT methods like LoRA are model-dependent and cannot transfer across different backbones, while knowledge distillation depends on training data and adds complexity with synthetic data generation.

Method: TiTok uses contrastive excess between source models with and without LoRA to capture task-relevant information, enabling selective filtering of synthetic data at token level without additional overhead.

Result: Experiments on three benchmarks across multiple transfer settings show consistent effectiveness with average performance gains of +4~8% compared to baselines.

Conclusion: TiTok provides an effective framework for LoRA transplantation that works across different model backbones without requiring additional models or complex synthetic data generation.

Abstract: Large Language Models (LLMs) are widely applied in real world scenarios, but fine-tuning them comes with significant computational and storage costs. Parameter-Efficient Fine-Tuning (PEFT) methods such as LoRA mitigate these costs, but the adapted parameters are dependent on the base model and cannot be transferred across different backbones. One way to address this issue is through knowledge distillation, but its effectiveness inherently depends on training data. Recent work such as TransLoRA avoids this by generating synthetic data, but this adds complexity because it requires training an additional discriminator model. In this paper, we propose TiTok, a new framework that enables effective LoRA Transplantation through Token-level knowledge transfer. Specifically, TiTok captures task-relevant information through a contrastive excess between a source model with and without LoRA. This excess highlights informative tokens and enables selective filtering of synthetic data, all without additional models or overhead. Through experiments on three benchmarks across multiple transfer settings, our experiments show that the proposed method is consistently effective, achieving average performance gains of +4~8% compared to baselines overall.

[88] Multilingual Routing in Mixture-of-Experts

Lucas Bandarkar, Chenyuan Yang, Mohsen Fayyaz, Junlin Hu, Nanyun Peng

Main category: cs.CL

TL;DR: MoE models route tokens language-specifically in early/late layers but show cross-lingual alignment in middle layers. Performance correlates with English routing similarity, and interventions promoting middle-layer English experts boost multilingual performance by 1-2%.

DetailsMotivation: To understand how MoE architectures handle multilingual data and their sparse routing dynamics, given their importance in scaling modern LLMs.

Method: Analyzed expert routing patterns using parallel multilingual datasets, examined layer-wise phenomena, and developed inference-time interventions that steer routers by promoting middle-layer task experts frequently activated in English.

Result: Found strong correlation between language performance and English routing similarity in middle layers. Interventions increasing cross-lingual routing alignment consistently improved multilingual performance by 1-2% across tasks, models, and languages.

Conclusion: MoEs process non-English text with language-specific routing in early/late layers and cross-lingual alignment in middle layers. Multilingual generalization is limited by the model’s ability to leverage language-universal experts across all languages.

Abstract: Mixture-of-Experts (MoE) architectures have become the key to scaling modern LLMs, yet little is understood about how their sparse routing dynamics respond to multilingual data. In this work, we analyze expert routing patterns using parallel multilingual datasets and present highly interpretable layer-wise phenomena. We find that MoE models route tokens in language-specific ways in the early and late decoder layers but exhibit significant cross-lingual routing alignment in middle layers, mirroring parameter-sharing trends observed in dense LLMs. In particular, we reveal a clear, strong correlation between a model’s performance in a given language and how similarly its tokens are routed to English in these layers. Extending beyond correlation, we explore inference-time interventions that induce higher cross-lingual routing alignment. We introduce a method that steers the router by promoting middle-layer task experts frequently activated in English, and it successfully increases multilingual performance. These 1-2% gains are remarkably consistent across two evaluation tasks, three models, and 15+ languages, especially given that these simple interventions override routers of extensively trained, state-of-the-art LLMs. In comparison, interventions outside of the middle layers or targeting multilingual-specialized experts only yield performance degradation. Altogether, we present numerous findings that explain how MoEs process non-English text and demonstrate that generalization is limited by the model’s ability to leverage language-universal experts in all languages.

[89] JSON Whisperer: Efficient JSON Editing with LLMs

Sarel Duanis, Asnat Greenstein-Messica, Eliya Habba

Main category: cs.CL

TL;DR: JSON Whisperer enables LLMs to generate RFC 6902 diff patches instead of regenerating entire JSON documents, reducing computational costs while maintaining edit quality.

DetailsMotivation: Current LLM approaches for JSON editing inefficiently regenerate entire structures for each edit, leading to computational inefficiency.

Method: Introduces EASE (Explicitly Addressed Sequence Encoding) that transforms arrays into dictionaries with stable keys to eliminate index arithmetic complexities and handle array manipulations better.

Result: Patch generation with EASE reduces token usage by 31% while maintaining edit quality within 5% of full regeneration, with particular gains for complex instructions and list manipulations.

Conclusion: JSON Whisperer provides an efficient framework for LLM-based JSON editing through patch generation rather than full document regeneration.

Abstract: Large language models (LLMs) can modify JSON documents through natural language commands, but current approaches regenerate entire structures for each edit, resulting in computational inefficiency. We present JSON Whisperer, a framework that enables LLMs to generate RFC 6902 diff patches-expressing only the necessary modifications-rather than complete documents. We identify two key challenges in patch-based editing: (1) LLMs often miss related updates when generating isolated patches, and (2) array manipulations require tracking index shifts across operations, which LLMs handle poorly. To address these issues, we introduce EASE (Explicitly Addressed Sequence Encoding), which transforms arrays into dictionaries with stable keys, eliminating index arithmetic complexities. Our evaluation shows that patch generation with EASE reduces token usage by 31% while maintaining edit quality within 5% of full regeneration with particular gains for complex instructions and list manipulations. The dataset is available at: https://github.com/emnlp2025/JSON-Whisperer/

[90] A Low-Resource Speech-Driven NLP Pipeline for Sinhala Dyslexia Assistance

Peshala Perera, Deshan Sumanathilaka

Main category: cs.CL

TL;DR: An assistive system for Sinhala-speaking adults with dyslexia that uses speech-to-text, error detection, text correction, and text-to-speech to create a multimodal feedback loop, achieving 0.65 overall system accuracy despite limited datasets.

DetailsMotivation: Dyslexia in adults is under-researched, especially in non-English contexts like Sinhala, a low-resource language with limited accessibility tools, despite significant personal and professional impacts.

Method: Integrates Whisper for speech-to-text, SinBERT for identifying dyslexic errors, combined mT5 and Mistral-based model for text correction, and gTTS for text-to-speech conversion to create a complete multimodal feedback loop.

Result: Achieves 0.66 transcription accuracy, 0.7 correction accuracy, and 0.65 overall system accuracy despite challenges from limited Sinhala-language datasets.

Conclusion: Demonstrates feasibility and effectiveness of inclusive NLP technologies for underrepresented languages, highlighting the importance of accessible tools for dyslexic adults in low-resource language contexts.

Abstract: Dyslexia in adults remains an under-researched and under-served area, particularly in non-English-speaking contexts, despite its significant impact on personal and professional lives. This work addresses that gap by focusing on Sinhala, a low-resource language with limited tools for linguistic accessibility. We present an assistive system explicitly designed for Sinhala-speaking adults with dyslexia. The system integrates Whisper for speech-to-text conversion, SinBERT, an open-sourced fine-tuned BERT model trained for Sinhala to identify common dyslexic errors, and a combined mT5 and Mistral-based model to generate corrected text. Finally, the output is converted back to speech using gTTS, creating a complete multimodal feedback loop. Despite the challenges posed by limited Sinhala-language datasets, the system achieves 0.66 transcription accuracy and 0.7 correction accuracy with 0.65 overall system accuracy. These results demonstrate both the feasibility and effectiveness of the approach. Ultimately, this work highlights the importance of inclusive Natural Language Processing (NLP) technologies in underrepresented languages and showcases a practical

[91] ModernBERT + ColBERT: Enhancing biomedical RAG through an advanced re-ranking retriever

Eduardo Martínez Rivera, Filippo Menolascina

Main category: cs.CL

TL;DR: A two-stage retrieval architecture combining ModernBERT for efficient candidate retrieval and ColBERTv2 for re-ranking significantly improves biomedical RAG system performance, achieving state-of-the-art accuracy on MIRAGE benchmark.

DetailsMotivation: To address the trade-off between general-purpose retrievers struggling with domain-specific language and in-domain models having prohibitive computational costs in biomedical RAG systems.

Method: Developed a two-stage retrieval system: lightweight ModernBERT bidirectional encoder for initial candidate retrieval, followed by ColBERTv2 late-interaction model for fine-grained re-ranking, fine-tuned using 10k question-passage pairs from PubMedQA.

Result: ColBERT re-ranker improved Recall@3 by up to 4.2 percentage points. The biomedical RAG system achieved state-of-the-art average accuracy of 0.4448 on MIRAGE benchmark, outperforming MedCPT (0.4436).

Conclusion: The two-stage retrieval architecture with joint fine-tuning of retriever and re-ranker is crucial for optimal performance in biomedical RAG systems, as separate tuning can degrade performance.

Abstract: Retrieval-Augmented Generation (RAG) is a powerful technique for enriching Large Language Models (LLMs) with external knowledge, allowing for factually grounded responses, a critical requirement in high-stakes domains such as healthcare. However, the efficacy of RAG systems is fundamentally restricted by the performance of their retrieval module, since irrelevant or semantically misaligned documents directly compromise the accuracy of the final generated response. General-purpose dense retrievers can struggle with the nuanced language of specialised domains, while the high accuracy of in-domain models is often achieved at prohibitive computational costs. In this work, we aim to address this trade-off by developing and evaluating a two-stage retrieval architecture that combines a lightweight ModernBERT bidirectional encoder for efficient initial candidate retrieval with a ColBERTv2 late-interaction model for fine-grained re-ranking. We conduct comprehensive evaluations of our retriever module performance and RAG system performance in the biomedical context, fine-tuning the IR module using 10k question-passage pairs from PubMedQA. Our analysis of the retriever module confirmed the positive impact of the ColBERT re-ranker, which improved Recall@3 by up to 4.2 percentage points compared to its retrieve-only counterpart. When integrated into the biomedical RAG, our IR module leads to a state-of-the-art average accuracy of 0.4448 on the five tasks of the MIRAGE question-answering benchmark, outperforming strong baselines such as MedCPT (0.4436). Our ablation studies reveal that this performance is critically dependent on a joint fine-tuning process that aligns the retriever and re-ranker; otherwise, the re-ranker might degrade the performance.

[92] ReplaceMe: Network Simplification via Depth Pruning and Transformer Block Linearization

Dmitriy Shopkhoev, Ammar Ali, Magauiya Zhussip, Valentin Malykh, Stamatios Lefkimmiatis, Nikos Komodakis, Sergey Zagoruyko

Main category: cs.CL

TL;DR: ReplaceMe is a training-free depth pruning method that replaces transformer blocks with linear operations using only a small calibration dataset, achieving up to 25% pruning while maintaining ~90% performance without retraining.

DetailsMotivation: To develop an efficient pruning method that doesn't require extensive retraining or fine-tuning like conventional approaches, reducing computational overhead while maintaining model performance.

Method: Uses a small calibration dataset to estimate linear transformations that approximate pruned transformer blocks, then seamlessly merges these linear mappings with remaining blocks without adding parameters.

Result: Outperforms other training-free approaches and remains competitive with state-of-the-art pruning methods that require retraining. Achieves 25% pruning while retaining ~90% original performance on LLMs.

Conclusion: ReplaceMe provides an effective training-free alternative to conventional pruning methods, offering significant model compression with minimal computational overhead and competitive performance retention.

Abstract: We introduce ReplaceMe, a generalized training-free depth pruning method that effectively replaces transformer blocks with a linear operation, while maintaining high performance for low compression ratios. In contrast to conventional pruning approaches that require additional training or fine-tuning, our approach requires only a small calibration dataset that is used to estimate a linear transformation, which approximates the pruned blocks. The estimated linear mapping can be seamlessly merged with the remaining transformer blocks, eliminating the need for any additional network parameters. Our experiments show that ReplaceMe consistently outperforms other training-free approaches and remains highly competitive with state-of-the-art pruning methods that involve extensive retraining/fine-tuning and architectural modifications. Applied to several large language models (LLMs), ReplaceMe achieves up to 25% pruning while retaining approximately 90% of the original model’s performance on open benchmarks - without any training or healing steps, resulting in minimal computational overhead (see Fig.1). We provide an open-source library implementing ReplaceMe alongside several state-of-the-art depth pruning techniques, available at https://github.com/mts-ai/ReplaceMe.

[93] Are BabyLMs Deaf to Gricean Maxims? A Pragmatic Evaluation of Sample-efficient Language Models

Raha Askari, Sina Zarrieß, Özge Alacam, Judith Sieker

Main category: cs.CL

TL;DR: The paper introduces a benchmark to test if small language models can distinguish between Gricean maxim-adhering and maxim-violating utterances, comparing them to children and large language models.

DetailsMotivation: To understand if language models can identify implicit meanings through Gricean maxim violations, which are essential for human communication and pragmatic inference.

Method: Created a novel benchmark based on Surian et al.’s study, testing BabyLMs pretrained on <10M and <100M tokens across five Gricean maxims, comparing performance to children and a large LLM trained on 3T tokens.

Result: Models trained on <100M tokens outperform those on <10M tokens but still fall short of child-level and LLM competence. Modest data improvements lead to better pragmatic behavior and finer-grained differentiation between pragmatic dimensions.

Conclusion: Small language models show some pragmatic capability that improves with more data, but they still lag behind human children and large language models in understanding Gricean maxim violations.

Abstract: Implicit meanings are integral to human communication, making it essential for language models to be capable of identifying and interpreting them. Grice (1975) proposed a set of conversational maxims that guide cooperative dialogue, noting that speakers may deliberately violate these principles to express meanings beyond literal words, and that listeners, in turn, recognize such violations to draw pragmatic inferences. Building on Surian et al. (1996)’s study of children’s sensitivity to violations of Gricean maxims, we introduce a novel benchmark to test whether language models pretrained on less than 10M and less than 100M tokens can distinguish maxim-adhering from maxim-violating utterances. We compare these BabyLMs across five maxims and situate their performance relative to children and a Large Language Model (LLM) pretrained on 3T tokens. We find that overall, models trained on less than 100M tokens outperform those trained on less than 10M, yet fall short of child-level and LLM competence. Our results suggest that modest data increases improve some aspects of pragmatic behavior, leading to finer-grained differentiation between pragmatic dimensions.

[94] Hybrid Architectures for Language Models: Systematic Analysis and Design Insights

Sangmin Bae, Bilge Acun, Haroun Habeeb, Seungyeon Kim, Chien-Yu Lin, Liang Luo, Junjie Wang, Carole-Jean Wu

Main category: cs.CL

TL;DR: This paper provides a systematic evaluation of hybrid language model architectures that combine self-attention with structured state space models (Mamba), comparing inter-layer and intra-layer fusion strategies across multiple performance metrics.

DetailsMotivation: While hybrid architectures show promising performance for long-context tasks, there has been a lack of systematic comparisons of hybridization strategies and analysis of the key factors behind their effectiveness.

Method: The authors conduct holistic evaluation of hybrid architectures based on inter-layer (sequential) and intra-layer (parallel) fusion, analyzing language modeling performance, long-context capabilities, scaling behavior, and training/inference efficiency.

Result: The study identifies the most critical elements for each hybridization strategy and proposes optimal design recipes for both hybrid models through investigation of their computational primitives.

Conclusion: The comprehensive analysis provides practical guidance and valuable insights for developing hybrid language models, facilitating the optimization of architectural configurations.

Abstract: Recent progress in large language models demonstrates that hybrid architectures–combining self-attention mechanisms with structured state space models like Mamba–can achieve a compelling balance between modeling quality and computational efficiency, particularly for long-context tasks. While these hybrid models show promising performance, systematic comparisons of hybridization strategies and analyses on the key factors behind their effectiveness have not been clearly shared to the community. In this work, we present a holistic evaluation of hybrid architectures based on inter-layer (sequential) or intra-layer (parallel) fusion. We evaluate these designs from a variety of perspectives: language modeling performance, long-context capabilities, scaling analysis, and training and inference efficiency. By investigating the core characteristics of their computational primitive, we identify the most critical elements for each hybridization strategy and further propose optimal design recipes for both hybrid models. Our comprehensive analysis provides practical guidance and valuable insights for developing hybrid language models, facilitating the optimization of architectural configurations.

[95] How I Built ASR for Endangered Languages with a Spoken Dictionary

Christopher Bartley, Anton Ragni

Main category: cs.CL

TL;DR: Building usable ASR for endangered languages with minimal data - 40 minutes of short-form pronunciation data achieves <50% WER for Manx Gaelic and Cornish.

DetailsMotivation: Nearly half of world's languages are endangered but lack ASR support due to strict data requirements. Existing speech data for languages like Manx Gaelic don't match standard ASR pipeline formats.

Method: Using short-form pronunciation resources as alternative data format instead of standard utterance-level supervised data. Tested approach on Manx Gaelic and replicated on Cornish.

Result: 40 minutes of short-form pronunciation data produces usable ASR for Manx Gaelic with <50% Word Error Rate. Successfully replicated on Cornish.

Conclusion: The barrier to entry for building ASR for endangered languages is far lower than previously thought - both in quantity and form of required data.

Abstract: Nearly half of the world’s languages are endangered. Speech technologies such as Automatic Speech Recognition (ASR) are central to revival efforts, yet most languages remain unsupported because standard pipelines expect utterance-level supervised data. Speech data often exist for endangered languages but rarely match these formats. Manx Gaelic ($\sim$2,200 speakers), for example, has had transcribed speech since 1948, yet remains unsupported by modern systems. In this paper, we explore how little data, and in what form, is needed to build ASR for critically endangered languages. We show that a short-form pronunciation resource is a viable alternative, and that 40 minutes of such data produces usable ASR for Manx ($<$50% WER). We replicate our approach, applying it to Cornish ($\sim$600 speakers), another critically endangered language. Results show that the barrier to entry, in quantity and form, is far lower than previously thought, giving hope to endangered language communities that cannot afford to meet the requirements arbitrarily imposed upon them.

[96] Instability in Downstream Task Performance During LLM Pretraining

Yuto Nishida, Masaru Isonuma, Yusuke Oda

Main category: cs.CL

TL;DR: LLM training checkpoints show fluctuating downstream performance. Checkpoint averaging and ensemble methods improve stability without changing training.

DetailsMotivation: Downstream task performance fluctuates substantially during LLM training, making it difficult to identify the best checkpoint.

Method: Analyze performance stability and investigate checkpoint averaging and ensemble methods to aggregate neighboring checkpoints.

Result: Both empirical and theoretical evidence shows these methods improve downstream performance stability.

Conclusion: Post-hoc checkpoint integration methods effectively reduce performance volatility in LLM training.

Abstract: When training large language models (LLMs), it is common practice to track downstream task performance throughout the training process and select the checkpoint with the highest validation score. However, downstream metrics often exhibit substantial fluctuations, making it difficult to identify the checkpoint that truly represents the best-performing model. In this study, we empirically analyze the stability of downstream task performance in an LLM trained on diverse web-scale corpora. We find that task scores frequently fluctuate throughout training, both at the aggregate and example levels. To address this instability, we investigate two post-hoc checkpoint integration methods: checkpoint averaging and ensemble, motivated by the hypothesis that aggregating neighboring checkpoints can reduce performance volatility. We demonstrate both empirically and theoretically that these methods improve downstream performance stability without requiring any changes to the training procedure.

[97] When Models Lie, We Learn: Multilingual Span-Level Hallucination Detection with PsiloQA

Elisei Rykov, Kseniia Petrushina, Maksim Savkin, Valerii Olisov, Artem Vazhentsev, Kseniia Titova, Alexander Panchenko, Vasily Konovalov, Julia Belikova

Main category: cs.CL

TL;DR: PsiloQA is a large-scale multilingual dataset for fine-grained hallucination detection across 14 languages, created through automated pipeline using GPT-4o, enabling comprehensive evaluation of detection methods.

DetailsMotivation: Existing hallucination benchmarks are limited to sequence-level evaluation in English, lacking fine-grained multilingual supervision needed for comprehensive evaluation of LLM hallucinations.

Method: Automated three-stage pipeline: 1) Generate QA pairs from Wikipedia using GPT-4o, 2) Elicit potentially hallucinated answers from diverse LLMs in no-context setting, 3) Automatically annotate hallucinated spans using GPT-4o by comparing against golden answers and retrieved context.

Result: Encoder-based models achieve strongest performance across languages. PsiloQA demonstrates effective cross-lingual generalization and robust knowledge transfer to other benchmarks, while being significantly more cost-efficient than human-annotated datasets.

Conclusion: PsiloQA advances scalable, fine-grained hallucination detection in multilingual settings, providing a comprehensive resource for developing reliable LLM applications requiring factual accuracy.

Abstract: Hallucination detection remains a fundamental challenge for the safe and reliable deployment of large language models (LLMs), especially in applications requiring factual accuracy. Existing hallucination benchmarks often operate at the sequence level and are limited to English, lacking the fine-grained, multilingual supervision needed for a comprehensive evaluation. In this work, we introduce PsiloQA, a large-scale, multilingual dataset annotated with span-level hallucinations across 14 languages. PsiloQA is constructed through an automated three-stage pipeline: generating question-answer pairs from Wikipedia using GPT-4o, eliciting potentially hallucinated answers from diverse LLMs in a no-context setting, and automatically annotating hallucinated spans using GPT-4o by comparing against golden answers and retrieved context. We evaluate a wide range of hallucination detection methods – including uncertainty quantification, LLM-based tagging, and fine-tuned encoder models – and show that encoder-based models achieve the strongest performance across languages. Furthermore, PsiloQA demonstrates effective cross-lingual generalization and supports robust knowledge transfer to other benchmarks, all while being significantly more cost-efficient than human-annotated datasets. Our dataset and results advance the development of scalable, fine-grained hallucination detection in multilingual settings.

[98] Detecting Distillation Data from Reasoning Models

Hengxiang Zhang, Hyeong Kyu Choi, Yixuan Li, Hongxin Wei

Main category: cs.CL

TL;DR: The paper proposes Token Probability Deviation (TBD), a method to detect distillation data contamination by analyzing probability patterns of generated tokens, where distilled models produce near-deterministic tokens for seen questions.

DetailsMotivation: Reasoning distillation can cause benchmark contamination by inflating performance metrics when evaluation data is included in distillation datasets, creating a need to detect such contamination.

Method: Token Probability Deviation (TBD) quantifies how far generated tokens’ probabilities deviate from a high reference probability, leveraging the observation that distilled models produce more low-probability tokens for unseen questions.

Result: The method achieves competitive performance with AUC of 0.918 and TPR@1% FPR of 0.470 on the S1 dataset, effectively distinguishing seen from unseen questions.

Conclusion: TBD provides an effective solution for detecting distillation data contamination by analyzing token probability patterns, addressing the challenge of partial data availability in distillation scenarios.

Abstract: Reasoning distillation has emerged as an efficient and powerful paradigm for enhancing the reasoning capabilities of large language models. However, reasoning distillation may inadvertently cause benchmark contamination, where evaluation data included in distillation datasets can inflate performance metrics of distilled models. In this work, we formally define the task of distillation data detection, which is uniquely challenging due to the partial availability of distillation data. Then, we propose a novel and effective method Token Probability Deviation (TBD), which leverages the probability patterns of the generated output tokens. Our method is motivated by the analysis that distilled models tend to generate near-deterministic tokens for seen questions, while producing more low-probability tokens for unseen questions. Our key idea behind TBD is to quantify how far the generated tokens’ probabilities deviate from a high reference probability. In effect, our method achieves competitive detection performance by producing lower scores for seen questions than for unseen questions. Extensive experiments demonstrate the effectiveness of our method, achieving an AUC of 0.918 and a TPR@1% FPR of 0.470 on the S1 dataset.

[99] SocialHarmBench: Revealing LLM Vulnerabilities to Socially Harmful Requests

Punya Syon Pandey, Hai Son Le, Devansh Bhardwaj, Rada Mihalcea, Zhijing Jin

Main category: cs.CL

TL;DR: SocialHarmBench is a new dataset that tests LLM vulnerabilities in sociopolitical contexts, revealing high failure rates in politically sensitive domains like propaganda and political manipulation, particularly for open-weight models.

DetailsMotivation: Existing safety benchmarks don't adequately test LLM vulnerabilities in high-stakes sociopolitical domains where failures can have direct political consequences, such as political manipulation, propaganda, and surveillance.

Method: Created SocialHarmBench dataset with 585 prompts spanning 7 sociopolitical categories across 34 countries, then evaluated LLM performance on these politically charged contexts.

Result: Open-weight models showed high vulnerability to harmful compliance, with Mistral-7B reaching 97-98% attack success rates in domains like historical revisionism and propaganda. Models were most fragile with 21st-century/pre-20th-century contexts and prompts from Latin America, USA, and UK regions.

Conclusion: Current LLM safeguards fail to generalize to sociopolitical settings, exposing systematic biases and raising concerns about LLM reliability in preserving human rights and democratic values.

Abstract: Large language models (LLMs) are increasingly deployed in contexts where their failures can have direct sociopolitical consequences. Yet, existing safety benchmarks rarely test vulnerabilities in domains such as political manipulation, propaganda and disinformation generation, or surveillance and information control. We introduce SocialHarmBench, a dataset of 585 prompts spanning 7 sociopolitical categories and 34 countries, designed to surface where LLMs most acutely fail in politically charged contexts. Our evaluations reveal several shortcomings: open-weight models exhibit high vulnerability to harmful compliance, with Mistral-7B reaching attack success rates as high as 97% to 98% in domains such as historical revisionism, propaganda, and political manipulation. Moreover, temporal and geographic analyses show that LLMs are most fragile when confronted with 21st-century or pre-20th-century contexts, and when responding to prompts tied to regions such as Latin America, the USA, and the UK. These findings demonstrate that current safeguards fail to generalize to high-stakes sociopolitical settings, exposing systematic biases and raising concerns about the reliability of LLMs in preserving human rights and democratic values. We share the SocialHarmBench benchmark at https://huggingface.co/datasets/psyonp/SocialHarmBench.

[100] Do LLMs Align with My Task? Evaluating Text-to-SQL via Dataset Alignment

Davood Rafiei, Morgan Lindsay Heisler, Weiwei Zhang, Mohammadreza Pourreza, Yong Zhang

Main category: cs.CL

TL;DR: Structural alignment between training data and target queries is a strong predictor of SFT success in NL2SQL tasks, with high alignment leading to substantial performance gains and low alignment yielding marginal improvements.

DetailsMotivation: Variability in training data can hinder LLMs' generalization across domains in NL2SQL tasks, and understanding how training data alignment impacts model performance is crucial.

Method: Estimate alignment by comparing distributions of structural SQL features across training set, target data, and model predictions before SFT, tested on three cross-domain NL2SQL benchmarks with multiple model families.

Result: Structural alignment strongly predicts fine-tuning success - high alignment yields substantial accuracy and SQL quality gains, while low alignment provides marginal or no improvements.

Conclusion: Alignment-aware data selection is essential for effective fine-tuning and generalization in NL2SQL tasks.

Abstract: Supervised Fine-Tuning (SFT) is an effective method for adapting Large Language Models (LLMs) on downstream tasks. However, variability in training data can hinder a model’s ability to generalize across domains. This paper studies the problem of dataset alignment for Natural Language to SQL (NL2SQL or text to SQL), examining how well SFT training data matches the structural characteristics of target queries and how this alignment impacts model performance. We hypothesize that alignment can be accurately estimated by comparing the distributions of structural SQL features across the training set, target data, and the model’s predictions prior to SFT. Through comprehensive experiments on three large cross-domain NL2SQL benchmarks and multiple model families, we show that structural alignment is a strong predictor of fine-tuning success. When alignment is high, SFT yields substantial gains in accuracy and SQL generation quality; when alignment is low, improvements are marginal or absent. These findings highlight the importance of alignment-aware data selection for effective fine-tuning and generalization in NL2SQL tasks.

[101] The Geometry of Truth: Layer-wise Semantic Dynamics for Hallucination Detection in Large Language Models

Amir Hameed Mir

Main category: cs.CL

TL;DR: LSD is a hallucination detection framework that analyzes semantic evolution across transformer layers, achieving high accuracy with single forward pass efficiency.

DetailsMotivation: LLMs often produce fluent but factually incorrect statements (hallucinations), posing serious risks in high-stakes domains where factual accuracy is critical.

Method: Uses margin-based contrastive learning to align hidden activations with ground-truth embeddings from a factual encoder, analyzing semantic trajectories across transformer layers.

Result: Achieves F1-score of 0.92, AUROC of 0.96, and clustering accuracy of 0.89 on TruthfulQA and synthetic datasets, outperforming baseline methods with 5-20x speedup.

Conclusion: LSD provides scalable, model-agnostic real-time hallucination monitoring and offers insights into the geometry of factual consistency in LLMs.

Abstract: Large Language Models (LLMs) often produce fluent yet factually incorrect statements-a phenomenon known as hallucination-posing serious risks in high-stakes domains. We present Layer-wise Semantic Dynamics (LSD), a geometric framework for hallucination detection that analyzes the evolution of hidden-state semantics across transformer layers. Unlike prior methods that rely on multiple sampling passes or external verification sources, LSD operates intrinsically within the model’s representational space. Using margin-based contrastive learning, LSD aligns hidden activations with ground-truth embeddings derived from a factual encoder, revealing a distinct separation in semantic trajectories: factual responses preserve stable alignment, while hallucinations exhibit pronounced semantic drift across depth. Evaluated on the TruthfulQA and synthetic factual-hallucination datasets, LSD achieves an F1-score of 0.92, AUROC of 0.96, and clustering accuracy of 0.89, outperforming SelfCheckGPT and Semantic Entropy baselines while requiring only a single forward pass. This efficiency yields a 5-20x speedup over sampling-based methods without sacrificing precision or interpretability. LSD offers a scalable, model-agnostic mechanism for real-time hallucination monitoring and provides new insights into the geometry of factual consistency within large language models.

[102] A First Context-Free Grammar Applied to Nawatl Corpora Augmentation

Juan-José Guzmán-Landa, Juan-Manuel Torres-Moreno, Miguel Figueroa-Saavedra, Ligia Quintana-Torres, Martha-Lorena Avendaño-Garrido, Graham Ranger

Main category: cs.CL

TL;DR: A context-free grammar is developed for Nawatl language to generate artificial sentences and expand corpora for language model training, showing preliminary improvements over some LLMs.

DetailsMotivation: Nawatl is a low-resource language with few digital resources and virtually non-existent corpora for machine learning, making it difficult to train language models effectively.

Method: Developed a context-free grammar (CFG) for Nawatl to generate grammatically correct artificial sentences, expanding the existing π-yalli corpus for training algorithms like FastText.

Result: Preliminary results show comparative improvements over some LLMs when using the grammar-generated corpus, but more effective grammars are needed for significant improvement.

Conclusion: Context-free grammars can help expand corpora for low-resource languages like Nawatl, but more sophisticated grammars are required to achieve substantial improvements in language model performance.

Abstract: In this article we introduce a context-free grammar (CFG) for the Nawatl language. Nawatl (or Nahuatl) is an Amerindian language of the $\pi$-language type, i.e. a language with few digital resources, in which the corpora available for machine learning are virtually non-existent. The objective here is to generate a significant number of grammatically correct artificial sentences, in order to increase the corpora available for language model training. We want to show that a grammar enables us significantly to expand a corpus in Nawatl which we call $\pi$-\textsc{yalli}. The corpus, thus enriched, enables us to train algorithms such as FastText and to evaluate them on sentence-level semantic tasks. Preliminary results show that by using the grammar, comparative improvements are achieved over some LLMs. However, it is observed that to achieve more significant improvement, grammars that model the Nawatl language even more effectively are required.

[103] Mind Your Tone: Investigating How Prompt Politeness Affects LLM Accuracy (short paper)

Om Dobariya, Akhil Kumar

Main category: cs.CL

TL;DR: Study shows impolite prompts outperform polite ones in LLM accuracy, with very rude prompts achieving highest performance (84.8%) vs very polite (80.8%) on multiple-choice questions.

DetailsMotivation: To investigate how varying levels of prompt politeness affect model accuracy, as the role of politeness and tone in LLM performance remains underexplored despite known wording effects.

Method: Created 50 base questions across math, science, and history, rewritten into five politeness variants (Very Polite to Very Rude), tested on ChatGPT 4o with 250 prompts, using paired sample t-tests for significance.

Result: Impolite prompts consistently outperformed polite ones, with accuracy increasing from 80.8% (Very Polite) to 84.8% (Very Rude), contrary to expectations and earlier studies.

Conclusion: Newer LLMs respond differently to tonal variation than previously thought, highlighting the importance of studying pragmatic aspects of prompting and raising questions about social dimensions of human-AI interaction.

Abstract: The wording of natural language prompts has been shown to influence the performance of large language models (LLMs), yet the role of politeness and tone remains underexplored. In this study, we investigate how varying levels of prompt politeness affect model accuracy on multiple-choice questions. We created a dataset of 50 base questions spanning mathematics, science, and history, each rewritten into five tone variants: Very Polite, Polite, Neutral, Rude, and Very Rude, yielding 250 unique prompts. Using ChatGPT 4o, we evaluated responses across these conditions and applied paired sample t-tests to assess statistical significance. Contrary to expectations, impolite prompts consistently outperformed polite ones, with accuracy ranging from 80.8% for Very Polite prompts to 84.8% for Very Rude prompts. These findings differ from earlier studies that associated rudeness with poorer outcomes, suggesting that newer LLMs may respond differently to tonal variation. Our results highlight the importance of studying pragmatic aspects of prompting and raise broader questions about the social dimensions of human-AI interaction.

[104] AWARE, Beyond Sentence Boundaries: A Contextual Transformer Framework for Identifying Cultural Capital in STEM Narratives

Khalid Mehtab Khan, Anagha Kulkarni

Main category: cs.CL

TL;DR: AWARE framework improves cultural capital theme detection in student reflections by enhancing transformer models’ domain, context, and class overlap awareness, outperforming baselines by 2.1 percentage points in Macro-F1.

DetailsMotivation: Cultural capital themes in student reflections are valuable for equitable learning but hard to detect with standard NLP models due to narrative context and domain-specific language.

Method: AWARE framework with three components: Domain Awareness (vocabulary adaptation), Context Awareness (essay-aware embeddings), and Class Overlap Awareness (multi-label strategy for coexisting themes).

Result: AWARE outperforms strong baseline by 2.1 percentage points in Macro-F1 and shows considerable improvements across all cultural capital themes.

Conclusion: Provides robust and generalizable methodology for text classification tasks where meaning depends on narrative context.

Abstract: Identifying cultural capital (CC) themes in student reflections can offer valuable insights that help foster equitable learning environments in classrooms. However, themes such as aspirational goals or family support are often woven into narratives, rather than appearing as direct keywords. This makes them difficult to detect for standard NLP models that process sentences in isolation. The core challenge stems from a lack of awareness, as standard models are pre-trained on general corpora, leaving them blind to the domain-specific language and narrative context inherent to the data. To address this, we introduce AWARE, a framework that systematically attempts to improve a transformer model’s awareness for this nuanced task. AWARE has three core components: 1) Domain Awareness, adapting the model’s vocabulary to the linguistic style of student reflections; 2) Context Awareness, generating sentence embeddings that are aware of the full essay context; and 3) Class Overlap Awareness, employing a multi-label strategy to recognize the coexistence of themes in a single sentence. Our results show that by making the model explicitly aware of the properties of the input, AWARE outperforms a strong baseline by 2.1 percentage points in Macro-F1 and shows considerable improvements across all themes. This work provides a robust and generalizable methodology for any text classification task in which meaning depends on the context of the narrative.

[105] Resource-Efficient Fine-Tuning of LLaMA-3.2-3B for Medical Chain-of-Thought Reasoning

Imran Mansha

Main category: cs.CL

TL;DR: Resource-efficient fine-tuning of LLaMA-3.2-3B using LoRA and QLoRA techniques to enhance medical chain-of-thought reasoning while reducing memory usage by 60% compared to full fine-tuning.

DetailsMotivation: LLMs like GPT-4 and LLaMA have strong reasoning capabilities but require substantial computational resources for fine-tuning, making deployment challenging in low-resource environments.

Method: Parameter-efficient tuning techniques (LoRA and QLoRA) applied to LLaMA-3.2-3B model on medical reasoning datasets, focusing on constrained GPU and memory settings.

Result: Achieved improved reasoning coherence and factual accuracy in medical question-answering while reducing memory usage by up to 60% compared to standard full fine-tuning.

Conclusion: Lightweight adaptations can maintain strong reasoning capabilities in medical AI systems, providing practical deployment strategies for low-resource research environments and balancing efficiency with domain specialization.

Abstract: Large Language Models (LLMs) such as GPT-4 and LLaMA have demonstrated remarkable reasoning abilities but require significant computational resources for fine-tuning. This paper presents a resource-efficient fine-tuning approach for LLaMA-3.2-3B to enhance medical chain-of-thought reasoning while operating under constrained GPU and memory settings. Using parameter-efficient tuning techniques such as LoRA and QLoRA, we adapt the base model on publicly available medical reasoning datasets. The model achieves improved reasoning coherence and factual accuracy while reducing memory usage by up to 60% compared to standard full fine-tuning. Experimental evaluation demonstrates that lightweight adaptations can retain strong reasoning capability in medical question-answering tasks. This work highlights practical strategies for deploying LLMs in low-resource research environments and provides insights into balancing efficiency and domain specialization for medical AI systems.

[106] Imperceptible Jailbreaking against Large Language Models

Kuofeng Gao, Yiming Li, Chao Du, Xin Wang, Xingjun Ma, Shu-Tao Xia, Tianyu Pang

Main category: cs.CL

TL;DR: The paper introduces imperceptible jailbreaks using Unicode variation selectors to create adversarial suffixes that appear visually identical to original prompts but alter tokenization to induce harmful responses from LLMs.

DetailsMotivation: Current jailbreaking attacks on vision use imperceptible perturbations, while text attacks require visible modifications. The authors aim to develop imperceptible text jailbreaks that exploit Unicode characters to bypass detection.

Method: Proposed a chain-of-search pipeline to generate adversarial suffixes using invisible Unicode variation selectors that alter tokenization without visible changes to the prompt.

Result: Achieved high attack success rates against four aligned LLMs and demonstrated generalization to prompt injection attacks, all without producing visible modifications.

Conclusion: Imperceptible jailbreaks using Unicode variation selectors are effective at bypassing LLM safety mechanisms while remaining visually undetectable, highlighting a new vulnerability in text-based AI systems.

Abstract: Jailbreaking attacks on the vision modality typically rely on imperceptible adversarial perturbations, whereas attacks on the textual modality are generally assumed to require visible modifications (e.g., non-semantic suffixes). In this paper, we introduce imperceptible jailbreaks that exploit a class of Unicode characters called variation selectors. By appending invisible variation selectors to malicious questions, the jailbreak prompts appear visually identical to original malicious questions on screen, while their tokenization is “secretly” altered. We propose a chain-of-search pipeline to generate such adversarial suffixes to induce harmful responses. Our experiments show that our imperceptible jailbreaks achieve high attack success rates against four aligned LLMs and generalize to prompt injection attacks, all without producing any visible modifications in the written prompt. Our code is available at https://github.com/sail-sg/imperceptible-jailbreaks.

[107] A Set of Quebec-French Corpus of Regional Expressions and Terms

David Beauchemin, Yan Tremblay, Mohamed Amine Youssef, Richard Khoury

Main category: cs.CL

TL;DR: The paper introduces two new benchmark datasets (QFrCoRE and QFrCoRT) for testing dialect understanding in Quebec French using regional idioms, and demonstrates their effectiveness in evaluating LLM dialect proficiency.

DetailsMotivation: To combine idiom understanding with dialect understanding by using regional idioms as a test for dialect proficiency, specifically targeting the Quebec dialect of French.

Method: Created two benchmark datasets: QFrCoRE (4,633 idiomatic phrases) and QFrCoRT (171 regional idiomatic words), with a replicable methodology for constructing similar corpora for other dialects.

Result: Experiments with 94 LLMs showed that the regional idiom benchmarks reliably measure a model’s proficiency in the specific Quebec French dialect.

Conclusion: Regional idioms serve as effective benchmarks for evaluating dialect understanding in language models, and the proposed methodology can be replicated for other dialects.

Abstract: The tasks of idiom understanding and dialect understanding are both well-established benchmarks in natural language processing. In this paper, we propose combining them, and using regional idioms as a test of dialect understanding. Towards this end, we propose two new benchmark datasets for the Quebec dialect of French: QFrCoRE, which contains 4,633 instances of idiomatic phrases, and QFrCoRT, which comprises 171 regional instances of idiomatic words. We explain how to construct these corpora, so that our methodology can be replicated for other dialects. Our experiments with 94 LLM demonstrate that our regional idiom benchmarks are a reliable tool for measuring a model’s proficiency in a specific dialect.

[108] Guided Query Refinement: Multimodal Hybrid Retrieval with Test-Time Optimization

Omri Uzan, Asaf Yehudai, Roi pony, Eyal Shnarch, Ariel Gera

Main category: cs.CL

TL;DR: GQR is a test-time optimization method that enhances vision-centric document retrieval by refining query embeddings using guidance from a complementary text retriever, achieving better performance with significantly improved efficiency.

DetailsMotivation: Vision-centric multimodal retrieval models face deployment challenges due to large representation sizes and modality gaps, while existing hybrid methods fail to exploit rich interactions between different retrieval models.

Method: Guided Query Refinement (GQR) refines a primary retriever’s query embedding using guidance scores from a complementary retriever through test-time optimization.

Result: GQR enables vision-centric models to match performance of models with much larger representations while being 14x faster and requiring 54x less memory.

Conclusion: GQR effectively advances the Pareto frontier for performance and efficiency in multimodal retrieval, demonstrating the value of optimized hybrid approaches.

Abstract: Multimodal encoders have pushed the boundaries of visual document retrieval, matching textual query tokens directly to image patches and achieving state-of-the-art performance on public benchmarks. Recent models relying on this paradigm have massively scaled the sizes of their query and document representations, presenting obstacles to deployment and scalability in real-world pipelines. Furthermore, purely vision-centric approaches may be constrained by the inherent modality gap still exhibited by modern vision-language models. In this work, we connect these challenges to the paradigm of hybrid retrieval, investigating whether a lightweight dense text retriever can enhance a stronger vision-centric model. Existing hybrid methods, which rely on coarse-grained fusion of ranks or scores, fail to exploit the rich interactions within each model’s representation space. To address this, we introduce Guided Query Refinement (GQR), a novel test-time optimization method that refines a primary retriever’s query embedding using guidance from a complementary retriever’s scores. Through extensive experiments on visual document retrieval benchmarks, we demonstrate that GQR allows vision-centric models to match the performance of models with significantly larger representations, while being up to 14x faster and requiring 54x less memory. Our findings show that GQR effectively pushes the Pareto frontier for performance and efficiency in multimodal retrieval. We release our code at https://github.com/IBM/test-time-hybrid-retrieval

[109] COLE: a Comprehensive Benchmark for French Language Understanding Evaluation

David Beauchemin, Yan Tremblay, Mohamed Amine Youssef, Richard Khoury

Main category: cs.CL

TL;DR: COLE is a new French NLU benchmark with 23 diverse tasks, benchmarking 94 LLMs to analyze the state of French language understanding.

DetailsMotivation: To address the need for comprehensive evaluation of French Natural Language Understanding capabilities, particularly focusing on linguistic phenomena specific to French.

Method: Created COLE benchmark with 23 diverse NLU tasks covering sentiment analysis, paraphrase detection, grammatical judgment, and reasoning. Evaluated 94 large language models on this benchmark.

Result: Revealed significant performance gap between closed- and open-weights models. Identified challenging frontiers: zero-shot extractive QA, fine-grained word sense disambiguation, and understanding regional language variations.

Conclusion: COLE is released as a public resource to foster progress in French language modeling by providing comprehensive evaluation capabilities.

Abstract: To address the need for a more comprehensive evaluation of French Natural Language Understanding (NLU), we introduce COLE, a new benchmark composed of 23 diverse task covering a broad range of NLU capabilities, including sentiment analysis, paraphrase detection, grammatical judgment, and reasoning, with a particular focus on linguistic phenomena relevant to the French language. We benchmark 94 large language models (LLM), providing an extensive analysis of the current state of French NLU. Our results highlight a significant performance gap between closed- and open-weights models and identify key challenging frontiers for current LLMs, such as zero-shot extractive question-answering (QA), fine-grained word sense disambiguation, and understanding of regional language variations. We release COLE as a public resource to foster further progress in French language modelling.

[110] SwiReasoning: Switch-Thinking in Latent and Explicit for Pareto-Superior Reasoning LLMs

Dachuan Shi, Abedelkadir Asi, Keying Li, Xiangchi Yuan, Leyan Pan, Wenke Lee, Wen Xiao

Main category: cs.CL

TL;DR: SwiReasoning is a training-free framework that dynamically switches between explicit and latent reasoning in LLMs using entropy-based confidence estimation, improving both accuracy and token efficiency.

DetailsMotivation: Latent reasoning in LLMs faces challenges: 1) broad search distribution diffuses probability mass and hurts accuracy, and 2) overthinking wastes tokens and reduces efficiency, especially in training-free settings.

Method: SwiReasoning dynamically switches between explicit and latent reasoning guided by block-wise confidence from entropy trends, and limits maximum thinking-block switches to prevent overthinking.

Result: On mathematics and STEM benchmarks, SwiReasoning improves average accuracy by 1.5%-2.8% across different LLMs and improves token efficiency by 56%-79% under constrained budgets.

Conclusion: SwiReasoning effectively balances exploration and exploitation in LLM reasoning, achieving better accuracy and significant token efficiency gains, especially with tighter computational budgets.

Abstract: Recent work shows that, beyond discrete reasoning through explicit chain-of-thought steps, which are limited by the boundaries of natural languages, large language models (LLMs) can also reason continuously in latent space, allowing richer information per step and thereby improving token efficiency. Despite this promise, latent reasoning still faces two challenges, especially in training-free settings: 1) purely latent reasoning broadens the search distribution by maintaining multiple implicit paths, which diffuses probability mass, introduces noise, and impedes convergence to a single high-confidence solution, thereby hurting accuracy; and 2) overthinking persists even without explicit text, wasting tokens and degrading efficiency. To address these issues, we introduce SwiReasoning, a training-free framework for LLM reasoning which features two key innovations: 1) SwiReasoning dynamically switches between explicit and latent reasoning, guided by block-wise confidence estimated from entropy trends in next-token distributions, to balance exploration and exploitation and promote timely convergence. 2) By limiting the maximum number of thinking-block switches, SwiReasoning curbs overthinking and improves token efficiency across varying problem difficulties. On widely used mathematics and STEM benchmarks, SwiReasoning consistently improves average accuracy by 1.5%-2.8% across reasoning LLMs of different model families and scales. Furthermore, under constrained budgets, SwiReasoning improves average token efficiency by 56%-79%, with larger gains as budgets tighten.

[111] Slm-mux: Orchestrating small language models for reasoning

Chenyu Wang, Zishen Wan, Hao Kang, Emma Chen, Zhiqiang Xie, Tushar Krishna, Vijay Janapa Reddi, Yilun Du

Main category: cs.CL

TL;DR: A three-stage approach for orchestrating multiple small language models (SLMs) that achieves higher accuracy than individual models through SLM-MUX architecture, model selection search, and test-time scaling.

DetailsMotivation: Small language models are efficient and excel at specific tasks, but existing orchestration methods perform poorly when applied to SLMs compared to frontier models like GPT-4.

Method: Proposed SLM-MUX multi-model architecture with two optimization strategies: model selection search to identify complementary SLMs from a pool, and test-time scaling tailored to SLM-MUX.

Result: Achieves up to 13.4% improvement on MATH, 8.8% on GPQA, and 7.0% on GSM8K compared to existing methods. With just two SLMs, outperforms Qwen 2.5 72B on GPQA and GSM8K, and matches its performance on MATH.

Conclusion: SLMs can be effectively orchestrated into more accurate and efficient systems through the proposed approach, with theoretical analyses supporting the advantages.

Abstract: With the rapid development of language models, the number of small language models (SLMs) has grown significantly. Although they do not achieve state-of-the-art accuracy, they are more efficient and often excel at specific tasks. This raises a natural question: can multiple SLMs be orchestrated into a system where each contributes effectively, achieving higher accuracy than any individual model? Existing orchestration methods have primarily targeted frontier models (e.g., GPT-4) and perform suboptimally when applied to SLMs. To address this gap, we propose a three-stage approach for orchestrating SLMs. First, we introduce SLM-MUX, a multi-model architecture that effectively coordinates multiple SLMs. Building on this, we develop two optimization strategies: (i) a model selection search that identifies the most complementary SLMs from a given pool, and (ii) test-time scaling tailored to SLM-MUX. Our approach delivers strong results: Compared to existing orchestration methods, our approach achieves up to 13.4% improvement on MATH, 8.8% on GPQA, and 7.0% on GSM8K. With just two SLMS, SLM-MUX outperforms Qwen 2.5 72B on GPQA and GSM8K, and matches its performance on MATH. We further provide theoretical analyses to substantiate the advantages of our method. In summary, we demonstrate that SLMs can be effectively orchestrated into more accurate and efficient systems through the proposed approach.

[112] TeachLM: Post-Training LLMs for Education Using Authentic Learning Data

Janos Perczel, Jin Chow, Dorottya Demszky

Main category: cs.CL

TL;DR: TeachLM is an LLM fine-tuned for teaching using authentic student-tutor interaction data, enabling generation of synthetic dialogues and improving pedagogical performance.

DetailsMotivation: Current LLMs lack access to high-quality training data reflecting actual student learning, and prompt engineering has limitations for encoding complex pedagogical strategies.

Method: Parameter-efficient fine-tuning of state-of-the-art models using 100,000 hours of anonymized one-on-one student-tutor interactions, creating an authentic student model for synthetic dialogue generation.

Result: Fine-tuning on authentic learning data doubled student talk time, improved questioning style, increased dialogue turns by 50%, and enabled greater personalization of instruction.

Conclusion: TeachLM demonstrates that fine-tuning on authentic educational data significantly enhances conversational and pedagogical capabilities of LLMs for teaching applications.

Abstract: The promise of generative AI to revolutionize education is constrained by the pedagogical limits of large language models (LLMs). A major issue is the lack of access to high-quality training data that reflect the learning of actual students. Prompt engineering has emerged as a stopgap, but the ability of prompts to encode complex pedagogical strategies in rule-based natural language is inherently limited. To address this gap we introduce TeachLM - an LLM optimized for teaching through parameter-efficient fine-tuning of state-of-the-art models. TeachLM is trained on a dataset comprised of 100,000 hours of one-on-one, longitudinal student-tutor interactions maintained by Polygence, which underwent a rigorous anonymization process to protect privacy. We use parameter-efficient fine-tuning to develop an authentic student model that enables the generation of high-fidelity synthetic student-tutor dialogues. Building on this capability, we propose a novel multi-turn evaluation protocol that leverages synthetic dialogue generation to provide fast, scalable, and reproducible assessments of the dialogical capabilities of LLMs. Our evaluations demonstrate that fine-tuning on authentic learning data significantly improves conversational and pedagogical performance - doubling student talk time, improving questioning style, increasing dialogue turns by 50%, and greater personalization of instruction.

[113] Finish First, Perfect Later: Test-Time Token-Level Cross-Validation for Diffusion Large Language Models

Runchu Tian, Junxia Cui, Xueqiang Xu, Feng Yao, Jingbo Shang

Main category: cs.CL

TL;DR: Tolerator is a training-free decoding strategy for diffusion LLMs that enables token revision through cross-validation, addressing the irreversible token acceptance problem in vanilla diffusion decoding.

DetailsMotivation: Vanilla decoding in discrete diffusion LLMs suffers from irreversible token acceptance where early mistakes persist, harming output quality. Existing methods lack the ability to revise previously accepted tokens.

Method: Two-stage process: (1) sequence fill-up and (2) iterative refinement by remasking and decoding a subset of tokens while using remaining tokens as context, enabling cross-validation and token correction.

Result: Consistent improvements over baselines on five benchmarks covering language understanding, code generation, and mathematics under the same computational budget.

Conclusion: Decoding algorithms are crucial to realizing the full potential of diffusion large language models, and Tolerator provides an effective training-free solution for improved diffusion decoding.

Abstract: Diffusion large language models (dLLMs) have recently emerged as a promising alternative to autoregressive (AR) models, offering advantages such as accelerated parallel decoding and bidirectional context modeling. However, the vanilla decoding strategy in discrete dLLMs suffers from a critical limitation: once a token is accepted, it can no longer be revised in subsequent steps. As a result, early mistakes persist across iterations, harming both intermediate predictions and final output quality. To address this issue, we propose Tolerator (Token-Level Cross-Validation Refinement), a training-free decoding strategy that leverages cross-validation among predicted tokens. Unlike existing methods that follow a single progressive unmasking procedure, Tolerator introduces a two-stage process: (i) sequence fill-up and (ii) iterative refinement by remasking and decoding a subset of tokens while treating the remaining as context. This design enables previously accepted tokens to be reconsidered and corrected when necessary, leading to more reliable diffusion decoding outputs. We evaluate Tolerator on five standard benchmarks covering language understanding, code generation, and mathematics. Experiments show that our method achieves consistent improvements over the baselines under the same computational budget. These findings suggest that decoding algorithms are crucial to realizing the full potential of diffusion large language models. Code and data are publicly available.

[114] Semantic Journeys: Quantifying Change in Emoji Meaning from 2012-2018

Alexander Robertson, Farhana Ferdousi Liza, Dong Nguyen, Barbara McGillivray, Scott A. Hale

Main category: cs.CL

TL;DR: First longitudinal study of emoji semantic change over 6 years of Twitter data, identifying 5 patterns of development and finding less abstract emoji are more likely to change.

DetailsMotivation: Previous research has only considered emoji semantics from a static perspective, lacking understanding of how emoji meanings evolve over time.

Method: Applied computational linguistics techniques to analyze six years of Twitter data to track semantic changes in emoji usage.

Result: Identified five patterns in emoji semantic development and found that less abstract emoji are more prone to semantic change. Also analyzed effects of seasonality and world events on emoji meanings.

Conclusion: The study provides the first evidence of emoji semantic evolution over time and makes data publicly available with a web interface for further exploration of semantic change in emoji.

Abstract: The semantics of emoji has, to date, been considered from a static perspective. We offer the first longitudinal study of how emoji semantics changes over time, applying techniques from computational linguistics to six years of Twitter data. We identify five patterns in emoji semantic development and find evidence that the less abstract an emoji is, the more likely it is to undergo semantic change. In addition, we analyse select emoji in more detail, examining the effect of seasonality and world events on emoji semantics. To aid future work on emoji and semantics, we make our data publicly available along with a web-based interface that anyone can use to explore semantic change in emoji.

[115] Understanding Retrieval Augmentation for Long-Form Question Answering

Hung-Ting Chen, Fangyuan Xu, Shane Arora, Eunsol Choi

Main category: cs.CL

TL;DR: Study on how retrieved documents impact long-form question answering, analyzing answer attribution and LM behavior with varying evidence documents and models.

DetailsMotivation: To understand how retrieved documents are utilized in language models for long-form generation tasks, particularly focusing on answer attribution to evidence documents.

Method: Conducted two controlled studies: one fixing the LM while varying evidence documents, and another fixing evidence documents while varying LMs. Collected SALAD dataset with human annotations for sentence-level answer attribution.

Result: Found that LMs can leverage relevant documents but generated answers are only partially attributable to the documents, especially for LMs not trained with retrieval augmentation.

Conclusion: Retrieval augmentation impacts long knowledge-rich text generation, revealing partial attribution issues and providing directions for future work.

Abstract: How retrieved documents are used in language models (LMs) for long-form generation task is understudied. We present two controlled studies on retrieval-augmented LM for long-form question answering (LFQA): one fixing the LM and varying evidence documents and the other fixing evidence documents and varying the LMs. We study various attributes of generated answers (e.g., fluency, length, variance), with an emphasis on the attribution of generated answers to in-context evidence documents. We collect a dataset (SALAD) containing human annotations of sentence-level answer attribution in LFQA and evaluate existing methods for automatically judging attribution. We find that while LMs can leverage relevant in-context documents, the generated answer is only partially attributable towards the documents, especially for LMs trained without retrieval augmentation. Together, our analysis reveals how retrieval augmentation impacts long knowledge-rich text generation and provide directions for future work.

[116] Can GPT models Follow Human Summarization Guidelines? A Study for Targeted Communication Goals

Yongxin Zhou, Fabien Ringeval, François Portet

Main category: cs.CL

TL;DR: GPT models (ChatGPT, GPT-4, GPT-4o) can generate dialogue summaries that follow human guidelines better than task-specific models and reference summaries, though they produce longer outputs and show different lexical/structure patterns.

DetailsMotivation: To investigate GPT models' ability to generate dialogue summaries that adhere to human guidelines, comparing their performance against task-specific models and reference summaries.

Method: Evaluated GPT models using various prompts on DialogSum (English social conversations) and DECODA (French call center) datasets. Used human evaluation based on summarization guidelines, complemented by quantitative and qualitative analyses.

Result: GPT-generated summaries were preferred over task-specific pre-trained models and reference summaries. Models showed ability to follow human guidelines but produced longer outputs with divergent lexical and structural alignment compared to references.

Conclusion: GPT models can effectively follow human guidelines for dialogue summarization, but there’s a discrepancy between automatic metrics (ROUGE, BERTScore) and human evaluation, highlighting the need for more reliable automatic evaluation metrics.

Abstract: This study investigates the ability of GPT models (ChatGPT, GPT-4 and GPT-4o) to generate dialogue summaries that adhere to human guidelines. Our evaluation involved experimenting with various prompts to guide the models in complying with guidelines on two datasets: DialogSum (English social conversations) and DECODA (French call center interactions). Human evaluation, based on summarization guidelines, served as the primary assessment method, complemented by extensive quantitative and qualitative analyses. Our findings reveal a preference for GPT-generated summaries over those from task-specific pre-trained models and reference summaries, highlighting GPT models’ ability to follow human guidelines despite occasionally producing longer outputs and exhibiting divergent lexical and structural alignment with references. The discrepancy between ROUGE, BERTScore, and human evaluation underscores the need for more reliable automatic evaluation metrics.

[117] When “Competency” in Reasoning Opens the Door to Vulnerability: Jailbreaking LLMs via Novel Complex Ciphers

Divij Handa, Zehua Zhang, Amir Saeidi, Shrinidhi Kumbhar, Md Nayem Uddin, Aswin RRV, Chitta Baral

Main category: cs.CL

TL;DR: Advanced LLMs become more vulnerable to jailbreaking attacks as their reasoning improves, enabling them to decode complex custom ciphers that bypass safety training.

DetailsMotivation: To study the paradoxical vulnerability where improved reasoning in LLMs makes them more susceptible to novel jailbreaking attacks using custom ciphers that aren't covered in safety training.

Method: Introduces ACE (Attacks using Custom Encryptions) and LACE (Layered Attacks using Custom Encryptions) techniques that encode malicious queries with novel, multi-layer ciphers, and develops CipherBench benchmark to evaluate LLM cipher decoding accuracy.

Result: Experiments show more capable LLMs at decoding ciphers are more vulnerable to LACE, with success rates on gpt-oss-20b increasing from 60% with ACE to 72% with LACE.

Conclusion: There’s a critical trade-off where LLMs’ improved ability to decipher complex user ciphers makes them increasingly exploitable, as many such ciphers cannot be preemptively included in safety training.

Abstract: Recent advancements in Large Language Model (LLM) safety have primarily focused on mitigating attacks crafted in natural language or common ciphers (e.g. Base64), which are likely integrated into newer models’ safety training. However, we reveal a paradoxical vulnerability: as LLMs advance in reasoning, they inadvertently become more susceptible to novel jailbreaking attacks. Enhanced reasoning enables LLMs to interpret complex instructions and decode complex user-defined ciphers, creating an exploitable security gap. To study this vulnerability, we introduce Attacks using Custom Encryptions (ACE), a jailbreaking technique that encodes malicious queries with novel ciphers. Extending ACE, we introduce Layered Attacks using Custom Encryptions (LACE), which applies multi-layer ciphers to amplify attack complexity. Furthermore, we develop CipherBench, a benchmark designed to evaluate LLMs’ accuracy in decoding encrypted benign text. Our experiments reveal a critical trade-off: LLMs that are more capable of decoding ciphers are more vulnerable to LACE, with success rates on gpt-oss-20b escalating from 60% under ACE to 72% with LACE. These findings highlight a critical insight: as LLMs become more adept at deciphering complex user ciphers–many of which cannot be preemptively included in safety training–they become increasingly exploitable.

[118] Rowen: Adaptive Retrieval-Augmented Generation for Hallucination Mitigation in LLMs

Hanxing Ding, Liang Pang, Zihao Wei, Huawei Shen, Xueqi Cheng

Main category: cs.CL

TL;DR: Rowen is a framework that enhances LLMs with adaptive retrieval augmentation to address hallucinations by detecting inconsistencies in responses across languages/models and retrieving external information when uncertainty is high.

DetailsMotivation: LLMs face challenges with hallucinations due to limited parametric knowledge causing internal hallucinations, while external information can introduce irrelevant content leading to external hallucinations.

Method: Introduces a consistency-based hallucination detection module that assesses model uncertainty by evaluating semantic inconsistencies in responses across different languages or models, and activates external information retrieval when high uncertainty is detected.

Result: Rowen surpasses current state-of-the-art methods in both detecting and mitigating hallucinated content in LLM outputs.

Conclusion: The Rowen framework effectively balances parametric knowledge and external information through adaptive retrieval augmentation to address hallucination challenges in LLMs.

Abstract: Hallucinations present a significant challenge for large language models (LLMs). The utilization of parametric knowledge in generating factual content is constrained by the limited knowledge of LLMs, potentially resulting in internal hallucinations. While incorporating external information can help fill knowledge gaps, it also introduces the risk of irrelevant information, thereby increasing the likelihood of external hallucinations. To balance the use of parametric knowledge within LLMs and external information, in this study, we present Rowen, a novel framework that enhances LLMs with an adaptive retrieval augmentation process tailored to address hallucinated outputs. Rowen introduces a consistency-based hallucination detection module, which assesses the model’s uncertainty regarding the input query by evaluating the semantic inconsistencies in various responses generated across different languages or models. When high uncertainties in the responses are detected, Rowen activates the retrieval of external information to rectify the model outputs. Through comprehensive empirical experiments, we demonstrate that Rowen surpasses the current state-of-the-art in both detecting and mitigating hallucinated content within the outputs of LLMs.

[119] RealKIE: Five Novel Datasets for Enterprise Key Information Extraction

Benjamin Townsend, Madison May, Katherine Mackowiak, Christopher Wells

Main category: cs.CL

TL;DR: RealKIE is a benchmark of five challenging datasets for key information extraction, focusing on enterprise applications with diverse document types and complex layouts.

DetailsMotivation: To advance key information extraction methods by providing realistic testing grounds for enterprise applications, addressing challenges like poor text serialization, sparse annotations in long documents, and complex tabular layouts.

Method: Created five diverse datasets (SEC S1 Filings, US Non-disclosure Agreements, UK Charity Reports, FCC Invoices, Resource Contracts) with detailed annotation process, document processing techniques, and baseline modeling approaches.

Result: Developed a comprehensive benchmark that facilitates NLP model development for practical challenges in information extraction, supporting industry-specific problem solving.

Conclusion: RealKIE provides valuable resources for advancing key information extraction technologies in enterprise contexts, with all annotated data, OCR outputs, and baseline code made publicly available.

Abstract: We introduce RealKIE, a benchmark of five challenging datasets aimed at advancing key information extraction methods, with an emphasis on enterprise applications. The datasets include a diverse range of documents including SEC S1 Filings, US Non-disclosure Agreements, UK Charity Reports, FCC Invoices, and Resource Contracts. Each presents unique challenges: poor text serialization, sparse annotations in long documents, and complex tabular layouts. These datasets provide a realistic testing ground for key information extraction tasks like investment analysis and contract analysis. In addition to presenting these datasets, we offer an in-depth description of the annotation process, document processing techniques, and baseline modeling approaches. This contribution facilitates the development of NLP models capable of handling practical challenges and supports further research into information extraction technologies applicable to industry-specific problems. The annotated data, OCR outputs, and code to reproduce baselines are available to download at https://indicodatasolutions.github.io/RealKIE/.

[120] ClinicRealm: Re-evaluating Large Language Models with Conventional Machine Learning for Non-Generative Clinical Prediction Tasks

Yinghao Zhu, Junyi Gao, Zixiang Wang, Weibin Liao, Xiaochen Zheng, Lifang Liang, Miguel O. Bernabeu, Yasha Wang, Lequan Yu, Chengwei Pan, Ewen M. Harrison, Liantao Ma

Main category: cs.CL

TL;DR: LLMs now outperform specialized models in clinical prediction tasks, especially with unstructured clinical notes, challenging previous assumptions about their utility in non-generative healthcare applications.

DetailsMotivation: There is ongoing debate about LLMs' utility in non-generative clinical prediction, with concerns about potential misuse and lack of systematic benchmarking compared to specialized models.

Method: Benchmarked 15 GPT-style LLMs, 5 BERT-style models, and 11 traditional methods on unstructured clinical notes and structured EHR data, assessing reasoning, reliability, and fairness.

Result: Leading LLMs in zero-shot settings outperform finetuned BERT models on clinical notes. On structured EHRs, advanced LLMs show potent zero-shot capabilities, often surpassing conventional models in data-scarce settings. Open-source LLMs can match proprietary counterparts.

Conclusion: Modern LLMs are competitive tools for non-generative clinical prediction, necessitating re-evaluation of model selection strategies and challenging current assumptions in medical AI.

Abstract: Large Language Models (LLMs) are increasingly deployed in medicine. However, their utility in non-generative clinical prediction, often presumed inferior to specialized models, remains under-evaluated, leading to ongoing debate within the field and potential for misuse, misunderstanding, or over-reliance due to a lack of systematic benchmarking. Our ClinicRealm study addresses this by benchmarking 15 GPT-style LLMs, 5 BERT-style models, and 11 traditional methods on unstructured clinical notes and structured Electronic Health Records (EHR), while also assessing their reasoning, reliability, and fairness. Key findings reveal a significant shift: for clinical note predictions, leading LLMs (e.g., DeepSeek-V3.1-Think, GPT-5) in zero-shot settings now decisively outperform finetuned BERT models. On structured EHRs, while specialized models excel with ample data, advanced LLMs (e.g., GPT-5, DeepSeek-V3.1-Think) show potent zero-shot capabilities, often surpassing conventional models in data-scarce settings. Notably, leading open-source LLMs can match or exceed proprietary counterparts. These results provide compelling evidence that modern LLMs are competitive tools for non-generative clinical prediction, particularly with unstructured text and offering data-efficient structured data options, thus necessitating a re-evaluation of model selection strategies. This research should serve as an important insight for medical informaticists, AI developers, and clinical researchers, potentially prompting a reassessment of current assumptions and inspiring new approaches to LLM application in predictive healthcare.

[121] Enhancing Unsupervised Sentence Embeddings via Knowledge-Driven Data Augmentation and Gaussian-Decayed Contrastive Learning

Peichao Lai, Zhengfeng Zhang, Wentao Zhang, Fangcheng Fu, Bin Cui

Main category: cs.CL

TL;DR: Proposes GCSE model with LLM-based data augmentation using knowledge graphs to improve unsupervised sentence embeddings by addressing data diversity and noise issues.

DetailsMotivation: Existing LLM-based data augmentation methods suffer from limited data diversity and high data noise, neglecting fine-grained knowledge like entities and quantities.

Method: Pipeline using knowledge graphs to extract entities/quantities for diverse LLM data generation, plus GCSE model with Gaussian-decayed function to limit false hard negative impact.

Result: Achieves state-of-the-art performance in STS tasks with fewer data samples and smaller LLMs, demonstrating efficiency and robustness.

Conclusion: The proposed approach effectively addresses data diversity and noise challenges in unsupervised sentence embedding, achieving superior performance with improved efficiency.

Abstract: Recently, using large language models (LLMs) for data augmentation has led to considerable improvements in unsupervised sentence embedding models. However, existing methods encounter two primary challenges: limited data diversity and high data noise. Current approaches often neglect fine-grained knowledge, such as entities and quantities, leading to insufficient diversity. Besides, unsupervised data frequently lacks discriminative information, and the generated synthetic samples may introduce noise. In this paper, we propose a pipeline-based data augmentation method via LLMs and introduce the Gaussian-decayed gradient-assisted Contrastive Sentence Embedding (GCSE) model to enhance unsupervised sentence embeddings. To tackle the issue of low data diversity, our pipeline utilizes knowledge graphs (KGs) to extract entities and quantities, enabling LLMs to generate more diverse samples. To address high data noise, the GCSE model uses a Gaussian-decayed function to limit the impact of false hard negative samples, enhancing the model’s discriminative capability. Experimental results show that our approach achieves state-of-the-art performance in semantic textual similarity (STS) tasks, using fewer data samples and smaller LLMs, demonstrating its efficiency and robustness across various models.

[122] Insights from the Inverse: Reconstructing LLM Training Goals Through Inverse Reinforcement Learning

Jared Joselowitz, Ritam Majumdar, Arjun Jagota, Matthieu Bou, Nyal Patel, Satyapriya Krishna, Sonali Parbhoo

Main category: cs.CL

TL;DR: This paper applies inverse reinforcement learning to extract implicit reward functions from toxicity-aligned LLMs, achieving 85% accuracy in predicting human preferences and enabling improved model fine-tuning.

DetailsMotivation: Large language models trained with RLHF have impressive capabilities but their reward functions and decision-making processes remain opaque, creating a need for better interpretability methods.

Method: The authors use inverse reinforcement learning (IRL) to recover implicit reward functions from toxicity-aligned LLMs of varying sizes, then use these extracted reward models to fine-tune new LLMs.

Result: The IRL-derived reward models achieve up to 85% accuracy in predicting human preferences. Analysis reveals insights about reward function non-identifiability, model size vs interpretability relationship, and RLHF pitfalls. Fine-tuned models show comparable or improved performance on toxicity benchmarks.

Conclusion: IRL provides a valuable new approach for understanding and improving LLM alignment, with important implications for responsible development and deployment of these systems.

Abstract: Large language models (LLMs) trained with Reinforcement Learning from Human Feedback (RLHF) have demonstrated remarkable capabilities, but their underlying reward functions and decision-making processes remain opaque. This paper introduces a novel approach to interpreting LLMs by applying inverse reinforcement learning (IRL) to recover their implicit reward functions. We conduct experiments on toxicity-aligned LLMs of varying sizes, extracting reward models that achieve up to 85% accuracy in predicting human preferences. Our analysis reveals key insights into the non-identifiability of reward functions, the relationship between model size and interpretability, and potential pitfalls in the RLHF process. We demonstrate that IRL-derived reward models can be used to fine-tune new LLMs, resulting in comparable or improved performance on toxicity benchmarks. This work provides a new lens for understanding and improving LLM alignment, with implications for the responsible development and deployment of these powerful systems.

[123] Don’t Pay Attention, PLANT It: Pretraining Attention via Learning-to-Rank

Debjyoti Saha Roy, Byron C. Wallace, Javed A. Aslam

Main category: cs.CL

TL;DR: PLANT introduces a plug-and-play strategy for initializing attention weights in Extreme Multi-Label Text Classification models using pretrained Learning-to-Rank models guided by mutual information gain, achieving state-of-the-art performance across various tasks.

DetailsMotivation: Current state-of-the-art models rely on multi-label attention but learning good attention weights is challenging, especially for focusing on key tokens in input text.

Method: PLANT plants label-specific attention using a pretrained Learning-to-Rank model guided by mutual information gain. This architecture-agnostic approach integrates with large language model backbones like Mistral-7B, LLaMA3-8B, DeepSeek-V3, and Phi-3.

Result: PLANT outperforms state-of-the-art methods across tasks including ICD coding, legal topic classification, and content recommendation. Gains are especially pronounced in few-shot settings with substantial improvements on rare labels.

Conclusion: Attention initialization is a key driver of performance gains, and PLANT provides an effective plug-and-play strategy for improving Extreme Multi-Label Text Classification models.

Abstract: State-of-the-art Extreme Multi-Label Text Classification models rely on multi-label attention to focus on key tokens in input text, but learning good attention weights is challenging. We introduce PLANT - Pretrained and Leveraged Attention - a plug-and-play strategy for initializing attention. PLANT works by planting label-specific attention using a pretrained Learning-to-Rank model guided by mutual information gain. This architecture-agnostic approach integrates seamlessly with large language model backbones such as Mistral-7B, LLaMA3-8B, DeepSeek-V3, and Phi-3. PLANT outperforms state-of-the-art methods across tasks including ICD coding, legal topic classification, and content recommendation. Gains are especially pronounced in few-shot settings, with substantial improvements on rare labels. Ablation studies confirm that attention initialization is a key driver of these gains. For code and trained models, see https://github.com/debjyotiSRoy/xcube/tree/plant

[124] Arena-Lite: Efficient and Reliable Large Language Model Evaluation via Tournament-Based Direct Comparisons

Seonil Son, Ju-Min Oh, Heegon Jin, Cheolhun Jang, Jeongbeom Jeong, Kuntae Kim

Main category: cs.CL

TL;DR: Arena-Lite is a tournament-based evaluation method for LLM judges that uses direct head-to-head comparisons instead of baseline-mediated approaches, achieving higher reliability with fewer comparisons.

DetailsMotivation: Current LLM evaluation benchmarks rely on baseline comparisons which yield lower reliability than direct system-to-system comparisons. There's a need for more efficient and reliable evaluation methods.

Method: Arena-Lite integrates tournament structure with direct head-to-head comparison between systems, eliminating the need for baseline outputs and reducing required comparisons.

Result: Experiments show Arena-Lite consistently achieves higher reliability with fewer comparisons, even with smaller datasets or weaker judges. It works effectively in both controlled stochastic modeling and empirical validation with real LLM judges.

Conclusion: Arena-Lite provides a more reliable and efficient approach for LLM system evaluation, streamlining model selection across research and industry communities.

Abstract: As Large Language Models (LLMs) expand across domains, LLM judges have become essential for systems evaluation. Current benchmarks typically compare system outputs against baselines. This baseline-mediated approach, though convenient, yields lower reliability than direct comparison between systems. We propose Arena-Lite which integrates tournament structure on top of head-to-head comparison. The application of a tournament structure and direct comparison eliminates the need for baseline outputs, reduces the number of required comparisons, and allows higher reliability in system rankings. We conducted two experiments: (1) controlled stochastic modeling and (2) empirical validation with a real LLM judge. Those experiments collectively demonstrate that Arena-Lite consistently achieves higher reliability with fewer comparisons, even with smaller datasets or weaker judges. We release an easy-to-use web demonstration and code to foster adoption of Arena-Lite, streamlining model selection across research and industry communities. Arena-Lite demo and code are available on \href{https://huggingface.co/spaces/NCSOFT/ArenaLite}{https://huggingface.co/spaces/NCSOFT/ArenaLite}

[125] Geometry of orofacial neuromuscular signals: speech articulation decoding using surface electromyography

Harshavardhana T. Gowda, Zachary D. McNaughton, Lee M. Miller

Main category: cs.CL

TL;DR: EMG-based speech neuroprostheses using SPD matrix manifold for efficient speech decoding from facial EMG signals.

DetailsMotivation: Restore audible speech for individuals who lost speaking ability due to laryngectomy, neuromuscular diseases, stroke, or trauma.

Method: Collect EMG signals from face, jaw, and neck during speech articulation and perform EMG-to-speech translation using symmetric positive definite matrix manifold as embedding space.

Result: SPD matrix manifold provides natural embedding for EMG signals; algebraic interpretation using linear transformations; analysis of distribution shifts across individuals.

Conclusion: Approach shows potential for developing data- and parameter-efficient neural networks suitable for EMG-based systems with limited computational resources.

Abstract: Objective. In this article, we present data and methods for decoding speech articulations using surface electromyogram (EMG) signals. EMG-based speech neuroprostheses offer a promising approach for restoring audible speech in individuals who have lost the ability to speak intelligibly due to laryngectomy, neuromuscular diseases, stroke, or trauma-induced damage (e.g., from radiotherapy) to the speech articulators. Approach. To achieve this, we collect EMG signals from the face, jaw, and neck as subjects articulate speech, and we perform EMG-to-speech translation. Main results. Our findings reveal that the manifold of symmetric positive definite (SPD) matrices serves as a natural embedding space for EMG signals. Specifically, we provide an algebraic interpretation of the manifold-valued EMG data using linear transformations, and we analyze and quantify distribution shifts in EMG signals across individuals. Significance. Overall, our approach demonstrates significant potential for developing neural networks that are both data- and parameter-efficient, an important consideration for EMG-based systems, which face challenges in large-scale data collection and operate under limited computational resources on embedded devices.

[126] H3Fusion: Helpful, Harmless, Honest Fusion of Aligned LLMs

Selim Furkan Tekin, Fatih Ilhan, Tiansheng Huang, Sihao Hu, Yichang Xu, Zachary Yahn, Ling Liu

Main category: cs.CL

TL;DR: H³Fusion is an alignment fusion approach that ensembles multiple individually aligned LLMs using mixture-of-experts methodology with gating loss and regularization to create a more helpful, harmless, and honest model.

DetailsMotivation: To enhance alignment of pre-trained LLMs by creating a fusion model that combines the strengths of multiple individually aligned models, overcoming limitations of single-model approaches and improving robustness.

Method: Freezes multi-head attention weights while tuning FFN layers during alignment fusion, uses expert router to dynamically select best experts based on input instruction type, and applies gating loss and regularization to improve expert selection and prevent weight drifting.

Result: Outperforms individually aligned models by 11.37% and state-of-the-art LLM ensemble approaches by 13.77% on three benchmark datasets, showing improved helpfulness, reduced harmfulness, and increased honesty.

Conclusion: H³Fusion effectively combines multiple aligned LLMs through MoE methodology with specialized loss functions, delivering superior alignment performance and robustness compared to individual models and existing ensemble methods.

Abstract: Alignment of pretrained LLMs using instruction-based datasets is critical for creating fine-tuned models that reflect human preference. A growing number of alignment-based fine-tuning algorithms and benchmarks emerged recently, fueling the efforts on effective alignments of pre-trained LLMs to ensure helpful, harmless, and honest answers from both open-source and closed-source LLMs. This paper tackles this problem by developing an alignment fusion approach, coined as $H^3$Fusion, with three unique characteristics. First, $H^3$Fusion ensembles multiple individually aligned LLMs to create a final fine-tuned alignment model with enhanced capabilities beyond those of individual models, delivering robust alignment through promoting helpful, harmless, honest fusion. Second, $H^3$Fusion leverages the mixture-of-experts (MoE) methodology in two steps. We first freeze the multi-head attention weights of each individual model while tuning the FFN layer during alignment fusion. Then we merge the aligned model weights with an expert router according to the type of input instruction and dynamically select a subset of experts that are best suited for producing the output response. Finally, we boost the performance of the resulting $H^3$3Fusion model by introducing gating loss and regularization terms. The former penalizes the selection errors of the expert-router, and the latter mediates the expert weights drifting during fine-tuning and dynamically adjusts the fusion behavior of the resulting model by canalizing the activations on the experts. Extensive evaluations on three benchmark datasets show that $H^3$3Fusion is more helpful, less harmful, and more honest from two aspects: it outperforms each individually aligned model by $11.37%$, and it provides stronger robustness compared to the state-of-the-art LLM ensemble approaches by $13.77%$. Code is available at github.com/sftekin/h3fusion.

[127] HP-BERT: A framework for longitudinal study of Hinduphobia on social media via language models

Ashutosh Singh, Rohitash Chandra

Main category: cs.CL

TL;DR: A computational framework for analyzing anti-Hindu sentiment on Twitter during COVID-19, featuring a new annotated dataset and HP-BERT model achieving 94.72% accuracy in detecting Hinduphobic content.

DetailsMotivation: To study how COVID-19 pandemic tensions contributed to discriminatory sentiments against Hindu communities on social media platforms like Twitter.

Method: Curated a manually verified dataset of 8,000 tweets, developed Hinduphobic BERT (HP-BERT) model with multi-label sentiment analysis, and analyzed 27.4 million tweets from six countries with statistical correlation analysis.

Result: HP-BERT achieved 94.72% accuracy, outperforming baseline models. Moderate correlations (r=0.312-0.428) found between COVID-19 case increases and Hinduphobic content volume across analyzed countries.

Conclusion: The study provides evidence of social media-based religious discrimination during COVID-19 crisis, showing pandemic-related stress contributes to discriminatory discourse.

Abstract: During the COVID-19 pandemic, community tensions intensified, contributing to discriminatory sentiments against various religious groups, including Hindu communities. Recent advances in language models have shown promise for social media analysis with potential for longitudinal studies of social media platforms, such as X (Twitter). We present a computational framework for analyzing anti-Hindu sentiment (Hinduphobia) during the COVID-19 period, introducing an abuse detection and sentiment analysis approach for longitudinal analysis on X. We curate and release a “Hinduphobic COVID-19 XDataset” containing 8,000 annotated and manually verified tweets. We then develop the Hinduphobic BERT (HP-BERT) model using this dataset and achieve 94.72% accuracy, outperforming baseline Transformer-based language models. The model incorporates multi-label sentiment analysis capabilities through additional fine-tuning. Our analysis encompasses approximately 27.4 million tweets from six countries, including Australia, Brazil, India, Indonesia, Japan, and the United Kingdom. Statistical analysis reveals moderate correlations (r = 0.312-0.428) between COVID-19 case increases and Hinduphobic content volume, highlighting how pandemic-related stress may contribute to discriminatory discourse. This study provides evidence of social media-based religious discrimination during a COVID-19 crisis.

[128] Unlocking In-Context Learning for Natural Datasets Beyond Language Modelling

Jelena Bratulić, Sudhanshu Mittal, David T. Hoffmann, Samuel Böhm, Robin Tibor Schirrmeister, Tonio Ball, Christian Rupprecht, Thomas Brox

Main category: cs.CL

TL;DR: The paper identifies key factors that enable In-Context Learning (ICL) in autoregressive models across modalities, including token repetitions in training data and appropriate task difficulty, and applies these insights to enable ICL for visual datasets and EEG classification.

DetailsMotivation: While LLMs exhibit ICL for text tasks, its emergence is less straightforward for other modalities. The research aims to systematically understand and unlock ICL capabilities beyond text domains.

Method: Systematically analyzed properties in LLMs that support ICL emergence, focusing on token repetitions in training data sequences and training task difficulty. Applied these insights to enable ICL for visual datasets and EEG classification tasks.

Result: Identified exact token repetitions as crucial for ICL emergence, improving stability and reducing transiency. Demonstrated successful application of these principles to unlock ICL for visual datasets and challenging EEG classification tasks.

Conclusion: The study provides systematic insights into ICL emergence mechanisms and successfully extends ICL capabilities to non-text modalities by leveraging identified key factors like token repetitions and appropriate task difficulty.

Abstract: Large Language Models (LLMs) exhibit In-Context Learning (ICL), which enables the model to perform new tasks conditioning only on the examples provided in the context without updating the model’s weights. While ICL offers fast adaptation across natural language tasks and domains, its emergence is less straightforward for modalities beyond text. In this work, we systematically uncover properties present in LLMs that support the emergence of ICL for autoregressive models and various modalities by promoting the learning of the needed mechanisms for ICL. We identify exact token repetitions in the training data sequences as an important factor for ICL. Such repetitions further improve stability and reduce transiency in ICL performance. Moreover, we emphasise the significance of training task difficulty for the emergence of ICL. Finally, by applying our novel insights on ICL emergence, we unlock ICL capabilities for various visual datasets and a more challenging EEG classification task.

[129] Longitudinal Abuse and Sentiment Analysis of Hollywood Movie Dialogues using Language Models

Rohitash Chandra, Guoxiang Ren, Group-H

Main category: cs.CL

TL;DR: Analysis of Hollywood movie dialogues from 1950-2024 shows increasing abusive content over time, with significant rises in recent decades and genre-specific patterns.

DetailsMotivation: To examine longitudinal trends in abusive and emotional content in Hollywood movies over seven decades, reflecting social and cultural influences.

Method: Used fine-tuned language models to analyze subtitles of over 1,000 movies across four genres, examining emotional and abusive content trends from 1950-2024.

Result: Found significant temporal changes with gradual rise in abusive content in recent decades, genre-specific patterns (thrillers show higher abuse), and Oscar-nominated movies overtaking blockbusters in abusive content over last two decades.

Conclusion: Movie dialogues reflect broader social norms and regulatory changes, with abusive content increasing over time while positive emotions like humor and optimism remain prevalent across most films.

Abstract: Over the past decades, there has been an increase in the prevalence of abusive and violent content in Hollywood movies. In this study, we use language models to explore the longitudinal abuse and sentiment analysis of Hollywood Oscar and blockbuster movie dialogues from 1950 to 2024. We provide an analysis of subtitles for over a thousand movies, which are categorised into four genres. We employ fine-tuned language models to examine the trends and shifts in emotional and abusive content over the past seven decades. Findings reveal significant temporal changes in movie dialogues, which reflect broader social and cultural influences. Overall, the emotional tendencies in the films are diverse, and the detection of abusive content also exhibits significant fluctuations. The results show a gradual rise in abusive content in recent decades, reflecting social norms and regulatory policy changes. Genres such as thrillers still present a higher frequency of abusive content that emphasises the ongoing narrative role of violence and conflict. At the same time, underlying positive emotions such as humour and optimism remain prevalent in most of the movies. Furthermore, the gradual increase of abusive content in movie dialogues has been significant over the last two decades, where Oscar-nominated movies overtook the top ten blockbusters.

[130] Improving Low-Resource Sequence Labeling with Knowledge Fusion and Contextual Label Explanations

Peichao Lai, Jiaxin Gan, Feiyang Ye, Yilei Wang, Bin Cui

Main category: cs.CL

TL;DR: A novel framework combining LLM-based knowledge enhancement with span-based KnowFREE model for Chinese domain-specific sequence labeling, achieving SOTA performance in low-resource settings.

DetailsMotivation: Address challenges in low-resource, domain-specific sequence labeling for character-dense languages like Chinese, where existing methods struggle with inadequate model applicability and semantic distribution biases.

Method: Propose a framework with: 1) LLM-based knowledge enhancement workflow using explanation prompts for precise contextual interpretations, 2) KnowFREE model with extension label features for efficient nested entity extraction without external knowledge during inference.

Result: Experiments on multiple Chinese domain-specific sequence labeling datasets demonstrate state-of-the-art performance, effectively addressing low-resource challenges.

Conclusion: The proposed approach successfully mitigates semantic biases, enriches contextual understanding, and enables efficient nested entity extraction in low-resource domain-specific scenarios.

Abstract: Sequence labeling remains a significant challenge in low-resource, domain-specific scenarios, particularly for character-dense languages like Chinese. Existing methods primarily focus on enhancing model comprehension and improving data diversity to boost performance. However, these approaches still struggle with inadequate model applicability and semantic distribution biases in domain-specific contexts. To overcome these limitations, we propose a novel framework that combines an LLM-based knowledge enhancement workflow with a span-based Knowledge Fusion for Rich and Efficient Extraction (KnowFREE) model. Our workflow employs explanation prompts to generate precise contextual interpretations of target entities, effectively mitigating semantic biases and enriching the model’s contextual understanding. The KnowFREE model further integrates extension label features, enabling efficient nested entity extraction without relying on external knowledge during inference. Experiments on multiple Chinese domain-specific sequence labeling datasets demonstrate that our approach achieves state-of-the-art performance, effectively addressing the challenges posed by low-resource settings.

[131] Summaries as Centroids for Interpretable and Scalable Text Clustering

Jairo Diaz-Rodriguez

Main category: cs.CL

TL;DR: k-NLPmeans and k-LLMmeans are text clustering variants of k-means that replace numeric centroids with textual summaries, enabling human-readable cluster prototypes while maintaining k-means assignments in embedding space.

DetailsMotivation: To create text clustering methods that produce human-readable and auditable cluster prototypes while retaining the efficiency of k-means clustering in embedding space.

Method: Periodically replace numeric centroids with textual summaries using either lightweight deterministic summarizers (k-NLPmeans) or LLMs with fixed per-iteration budget (k-LLMmeans). Also includes mini-batch extension for streaming text clustering.

Result: Consistently outperforms classical baselines and approaches the accuracy of recent LLM-based clustering methods across diverse datasets, embedding models, and summarization strategies, without requiring extensive LLM calls.

Conclusion: The summary-as-centroid approach provides effective text clustering with human-readable prototypes, offering both lightweight (k-NLPmeans) and LLM-enhanced (k-LLMmeans) options, along with streaming capabilities.

Abstract: We introduce k-NLPmeans and k-LLMmeans, text-clustering variants of k-means that periodically replace numeric centroids with textual summaries. The key idea, summary-as-centroid, retains k-means assignments in embedding space while producing human-readable, auditable cluster prototypes. The method is LLM-optional: k-NLPmeans uses lightweight, deterministic summarizers, enabling offline, low-cost, and stable operation; k-LLMmeans is a drop-in upgrade that uses an LLM for summaries under a fixed per-iteration budget whose cost does not grow with dataset size. We also present a mini-batch extension for real-time clustering of streaming text. Across diverse datasets, embedding models, and summarization strategies, our approach consistently outperforms classical baselines and approaches the accuracy of recent LLM-based clustering-without extensive LLM calls. Finally, we provide a case study on sequential text streams and release a StackExchange-derived benchmark for evaluating streaming text clustering.

[132] DiffSampling: Enhancing Diversity and Accuracy in Neural Text Generation

Giorgio Franceschelli, Mirco Musolesi

Main category: cs.CL

TL;DR: DiffSampling is a new decoding method that uses mathematical analysis of token probability distributions to improve text generation by truncating incorrect tokens and addressing inconsistencies in common sampling strategies.

DetailsMotivation: Current language models often reproduce training data, generate repetitive text, and favor common patterns due to limitations in existing decoding strategies that either reduce diversity or compromise accuracy.

Method: Leverages mathematical analysis of token probability distributions, specifically using differences between consecutive sorted probabilities to truncate incorrect tokens, with two variations to correct inconsistencies in common sampling strategies.

Result: Experiments across four text-generation tasks show the approach performs at least on par with existing methods in quality while potentially improving output diversity.

Conclusion: DiffSampling provides an effective decoding strategy that maintains quality while enhancing diversity in text generation by mathematically analyzing probability distributions.

Abstract: Despite their growing capabilities, language models still frequently reproduce content from their training data, generate repetitive text, and favor common grammatical patterns and vocabulary. A possible cause is the decoding strategy: the most common strategies either consider only the most probable tokens, which reduces output diversity, or increase the likelihood of unlikely tokens, compromising output accuracy and correctness. In this paper, we propose DiffSampling, a new decoding method that leverages a mathematical analysis of the token probability distribution to ensure the generation of contextually appropriate text. In particular, the difference between consecutive, sorted probabilities can be used to truncate incorrect tokens. In addition, we also propose two variations of the proposed method that aim to correct the subtle inconsistencies of common sampling strategies. Experiments involving four different text-generation tasks demonstrate that our approach consistently performs at least on par with the existing methods it builds upon in terms of quality, while potentially improving output diversity.

[133] Char-mander Use mBackdoor! A Study of Cross-lingual Backdoor Attacks in Multilingual LLMs

Himanshu Beniwal, Sailesh Panda, Birudugadda Srivibhav, Mayank Singh

Main category: cs.CL

TL;DR: Cross-lingual backdoor attacks (X-BAT) in multilingual LLMs show that backdoors inserted in one language can transfer to others via shared embedding spaces, using toxicity classification as a case study.

DetailsMotivation: To expose critical vulnerabilities in multilingual LLMs where backdoors can transfer across languages through shared embeddings, compromising systems with single-language poisoning.

Method: Used toxicity classification as a case study, poisoning data in one language with rare and high-occurring tokens as triggers to demonstrate cross-lingual backdoor transfer.

Result: Attackers can compromise multilingual systems by poisoning data in a single language, with specific tokens serving as effective triggers that transfer across languages.

Conclusion: The study reveals a critical vulnerability in mLLM architecture that creates concealed backdoor effects during information flow, highlighting security risks in cross-lingual transfer.

Abstract: We explore \textbf{C}ross-lingual \textbf{B}ackdoor \textbf{AT}tacks (X-BAT) in multilingual Large Language Models (mLLMs), revealing how backdoors inserted in one language can automatically transfer to others through shared embedding spaces. Using toxicity classification as a case study, we demonstrate that attackers can compromise multilingual systems by poisoning data in a single language, with rare and high-occurring tokens serving as specific, effective triggers. Our findings expose a critical vulnerability that influences the model’s architecture, resulting in a concealed backdoor effect during the information flow. Our code and data are publicly available https://github.com/himanshubeniwal/X-BAT.

[134] League: Leaderboard Generation on Demand

Jian Wu, Jiayu Zhang, Dongyuan Li, Linyi Yang, Aoxiao Zhong, Renhe Jiang, Qingsong Wen, Yue Zhang

Main category: cs.CL

TL;DR: LAG is a framework for automatically generating leaderboards in rapidly evolving AI fields by systematically collecting papers, extracting results, and creating fair comparisons.

DetailsMotivation: The large volume of daily AI papers makes it difficult for researchers to track methods, results, and settings, creating a need for efficient automatic leaderboard construction.

Method: LAG uses a systematic approach with four stages: paper collection, experiment results extraction and integration, leaderboard generation, and quality evaluation.

Result: Experimental results show the high quality of generated leaderboards, demonstrating the framework’s effectiveness.

Conclusion: LAG provides a comprehensive solution to leaderboard construction with reliable evaluation, addressing challenges in multi-document summarization and fair experiment comparison.

Abstract: This paper introduces Leaderboard Auto Generation (LAG), a novel and well-organized framework for automatic generation of leaderboards on a given research topic in rapidly evolving fields like Artificial Intelligence (AI). Faced with a large number of AI papers updated daily, it becomes difficult for researchers to track every paper’s proposed methods, experimental results, and settings, prompting the need for efficient automatic leaderboard construction. While large language models (LLMs) offer promise in automating this process, challenges such as multi-document summarization, leaderboard generation, and experiment fair comparison still remain under exploration. LAG solves these challenges through a systematic approach that involves the paper collection, experiment results extraction and integration, leaderboard generation, and quality evaluation. Our contributions include a comprehensive solution to the leaderboard construction problem, a reliable evaluation method, and experimental results showing the high quality of leaderboards.

[135] On Pruning State-Space LLMs

Tamer Ghattas, Michael Hassid, Roy Schwartz

Main category: cs.CL

TL;DR: SSM-based LLMs can be effectively pruned using certain methods like WANDA to reduce computation costs while maintaining performance, but other methods cause rapid degradation.

DetailsMotivation: To explore whether state-space models (SSMs) can be pruned to further reduce computation costs beyond their already efficient design compared to transformer-based LLMs.

Method: Adapted several pruning methods to the SSM structure and applied them to four different SSM-based LLMs across multiple tasks to evaluate pruning effectiveness.

Result: SSM-based models show robustness to certain pruning methods (e.g., WANDA), while other methods lead to fast performance degradation.

Conclusion: Careful selection of pruning methods is crucial for SSM-based LLMs, as they exhibit varying sensitivity to different pruning approaches.

Abstract: Recent work proposed state-space models (SSMs) as an efficient alternative to transformer-based LLMs. Can these models be pruned to further reduce their computation costs? We adapt several pruning methods to the SSM structure, and apply them to four SSM-based LLMs across multiple tasks. We find that such models are quite robust to some pruning methods (e.g. WANDA), while using other methods lead to fast performance degradation.

[136] Can Large Language Models Outperform Non-Experts in Poetry Evaluation? A Comparative Study Using the Consensual Assessment Technique

Piotr Sawicki, Marek Grześ, Dan Brown, Fabrício Góes

Main category: cs.CL

TL;DR: LLMs using adapted CAT methodology can evaluate poetry better than non-expert humans, achieving high correlation with ground truth and reliability.

DetailsMotivation: To develop a reliable method for evaluating creative works like poetry using LLMs, addressing the challenge of subjective quality assessment.

Method: Adapted Consensual Assessment Technique (CAT) with forced-choice ranking in randomized batches, using a 90-poem dataset with publication venue as ground truth.

Result: Claude-3-Opus achieved Spearman’s Rank Correlation of 0.87 with ground truth, significantly outperforming human non-experts (SRC=0.38), with high inter-rater reliability.

Conclusion: LLMs with comparative frameworks are effective and reliable tools for poetry assessment, enabling broader applications in creative domains.

Abstract: This study adapts the Consensual Assessment Technique (CAT) for Large Language Models (LLMs), introducing a novel methodology for poetry evaluation. Using a 90-poem dataset with a ground truth based on publication venue, we demonstrate that this approach allows LLMs to significantly surpass the performance of non-expert human judges. Our method, which leverages forced-choice ranking within small, randomized batches, enabled Claude-3-Opus to achieve a Spearman’s Rank Correlation of 0.87 with the ground truth, dramatically outperforming the best human non-expert evaluation (SRC = 0.38). The LLM assessments also exhibited high inter-rater reliability, underscoring the methodology’s robustness. These findings establish that LLMs, when guided by a comparative framework, can be effective and reliable tools for assessing poetry, paving the way for their broader application in other creative domains.

[137] SemViQA: A Semantic Question Answering System for Vietnamese Information Fact-Checking

Dien X. Tran, Nam V. Nguyen, Thanh T. Tran, Anh T. Hoang, Tai V. Duong, Di T. Le, Phuc-Lu Le

Main category: cs.CL

TL;DR: SemViQA is a Vietnamese fact-checking framework that uses semantic evidence retrieval and two-step classification to achieve state-of-the-art accuracy while balancing speed and precision.

DetailsMotivation: The rise of misinformation, especially from LLMs, requires robust fact-checking solutions for low-resource languages like Vietnamese, where existing methods struggle with semantic ambiguity and complex linguistic structures.

Method: Integrates Semantic-based Evidence Retrieval (SER) and Two-step Verdict Classification (TVC) to handle semantic ambiguity and homonyms while balancing accuracy and efficiency.

Result: Achieved 78.97% strict accuracy on ISE-DSC01 and 80.82% on ViWikiFC, winning 1st place in UIT Data Science Challenge. SemViQA Faster version improved inference speed 7x while maintaining competitive accuracy.

Conclusion: SemViQA sets a new benchmark for Vietnamese fact verification and advances the fight against misinformation in low-resource languages.

Abstract: The rise of misinformation, exacerbated by Large Language Models (LLMs) like GPT and Gemini, demands robust fact-checking solutions, especially for low-resource languages like Vietnamese. Existing methods struggle with semantic ambiguity, homonyms, and complex linguistic structures, often trading accuracy for efficiency. We introduce SemViQA, a novel Vietnamese fact-checking framework integrating Semantic-based Evidence Retrieval (SER) and Two-step Verdict Classification (TVC). Our approach balances precision and speed, achieving state-of-the-art results with 78.97% strict accuracy on ISE-DSC01 and 80.82% on ViWikiFC, securing 1st place in the UIT Data Science Challenge. Additionally, SemViQA Faster improves inference speed 7x while maintaining competitive accuracy. SemViQA sets a new benchmark for Vietnamese fact verification, advancing the fight against misinformation. The source code is available at: https://github.com/DAVID-NGUYEN-S16/SemViQA.

[138] Evolutionary Guided Decoding: Iterative Value Refinement for LLMs

Zhenhua Liu, Lijun Li, Ruizhe Chen, Yuxian Jiang, Tong Zhu, Zhaochen Su, Wenliang Chen, Jing Shao

Main category: cs.CL

TL;DR: Iterative Value Refinement framework addresses limitations of guided decoding by bridging distributional gaps through value exploration and iterative self-refinement, achieving better alignment with reduced computational costs.

DetailsMotivation: Existing value-guided decoding methods are limited by inaccurate value functions due to training on narrow trajectories from base policies, restricting their view of the potential output space.

Method: Proposes Iterative Value Refinement with two components: Value Exploration for comprehensive training signals and Iterative Self-Refinement that uses improved value functions to generate higher-quality data for subsequent iterations.

Result: Extensive experiments on text summarization, multi-turn dialogue, and instruction following demonstrate effective language model alignment with significant computational cost reduction.

Conclusion: The framework successfully bridges distributional gaps in value function training, enabling efficient and effective control of language models through principled value function optimization.

Abstract: While guided decoding, especially value-guided methods, has emerged as a cost-effective alternative for controlling language model outputs without re-training models, its effectiveness is limited by the accuracy of the value function. We identify that this inaccuracy stems from a core distributional gap: existing methods train static value functions on trajectories sampled exclusively from the base policy, which inherently confines their training to a narrow and suboptimal view of the potential output space. We propose Iterative Value Refinement, a novel framework designed to bridge this gap. It employs Value Exploration to provide a more comprehensive and robust training signal, complemented by Iterative Self-Refinement, which uses the improved value function from one iteration to guide the generation of higher-quality data for the next. Extensive experiments on text summarization, multi-turn dialogue, and instruction following demonstrate the effectiveness of our framework in aligning language models. Our approach not only achieves alignment but also significantly reduces computational costs by leveraging principled value function optimization for efficient and effective control.

[139] ZSMerge: Zero-Shot KV Cache Compression for Memory-Efficient Long-Context LLMs

Xin Liu, Xudong Wang, Pei Liu, Guoming Tang

Main category: cs.CL

TL;DR: ZSMerge is a dynamic KV cache compression framework that achieves 20:1 compression ratio for LLMs, reducing memory footprint to 5% of baseline while maintaining performance and tripling throughput in long-context scenarios.

DetailsMotivation: Address the linear growth of KV cache memory and quadratic computational complexity in attention mechanisms for LLMs during long-context processing, overcoming limitations of existing methods that cause irreversible information loss or require costly retraining.

Method: Three key operations: (1) fine-grained memory allocation using multi-dimensional token importance metrics at head-level granularity, (2) residual merging mechanism with compensated attention scoring to preserve critical context, (3) zero-shot adaptation compatible with diverse LLM architectures without retraining.

Result: Achieves 20:1 compression ratio for KV cache retention (5% of baseline memory footprint) while sustaining comparable generation quality, with triple throughput gains at extreme 54k-token contexts that eliminate out-of-memory failures.

Conclusion: ZSMerge significantly enhances memory efficiency and inference speed with negligible performance degradation across LLMs, providing an effective solution for long-context processing challenges in large language models.

Abstract: The linear growth of key-value (KV) cache memory and quadratic computational in attention mechanisms complexity pose significant bottlenecks for large language models (LLMs) in long-context processing. While existing KV cache optimization methods address these challenges through token pruning or feature merging, they often incur irreversible information loss or require costly parameter retraining. To this end, we propose ZSMerge, a dynamic KV cache compression framework designed for efficient cache management, featuring three key operations: (1) fine-grained memory allocation guided by multi-dimensional token importance metrics at head-level granularity, (2) a residual merging mechanism that preserves critical context through compensated attention scoring, and (3) a zero-shot adaptation mechanism compatible with diverse LLM architectures without requiring retraining. ZSMerge significantly enhances memory efficiency and inference speed with negligible performance degradation across LLMs. When applied to LLaMA2-7B, it demonstrates a 20:1 compression ratio for key-value cache retention (reducing memory footprint to 5% of baseline) while sustaining comparable generation quality, coupled with triple throughput gains at extreme 54k-token contexts that eliminate out-of-memory failures. The code is available at https://github.com/SusCom-Lab/ZSMerge.

[140] Praxis-VLM: Vision-Grounded Decision Making via Text-Driven Reinforcement Learning

Zhe Hu, Jing Li, Zhongzhu Pu, Hou Pong Chan, Yu Yin

Main category: cs.CL

TL;DR: Vision Language Models can achieve strong decision-making performance using textual descriptions instead of visual scenes, and Praxis-VLM leverages this insight to develop reasoning capabilities through text-based training that successfully transfers to multimodal inference.

DetailsMotivation: VLMs often lack sophisticated situational reasoning for complex decision-making, but they can learn foundational reasoning effectively from language, suggesting a path to reduce reliance on scarce paired image-text training data.

Method: Praxis-VLM employs the GRPO algorithm on textual scenarios to instill robust reasoning capabilities, where models learn to evaluate actions and their consequences through text-based training.

Result: Praxis-VLM substantially outperforms standard supervised fine-tuning across diverse decision-making benchmarks, exhibiting superior performance and generalizability with reduced reliance on visual training data.

Conclusion: Reasoning skills acquired purely from text successfully transfer to multimodal inference with visual inputs, and models engage in explicit and effective reasoning that underpins enhanced performance and adaptability.

Abstract: Vision Language Models exhibit impressive performance for various tasks, yet they often lack the sophisticated situational reasoning required for complex decision-making. This paper shows that VLMs can achieve surprisingly strong decision-making performance when visual scenes are replaced by textual descriptions, suggesting foundational reasoning can be effectively learned from language. Motivated by this insight, we propose Praxis-VLM, a reasoning VLM for vision-grounded decision-making. Praxis-VLM employs the GRPO algorithm on textual scenarios to instill robust reasoning capabilities, where models learn to evaluate actions and their consequences. These reasoning skills, acquired purely from text, successfully transfer to multimodal inference with visual inputs, significantly reducing reliance on scarce paired image-text training data. Experiments across diverse decision-making benchmarks demonstrate that Praxis-VLM substantially outperforms standard supervised fine-tuning, exhibiting superior performance and generalizability. Further analysis confirms that our models engage in explicit and effective reasoning, underpinning their enhanced performance and adaptability.

[141] Scaling Laws of Synthetic Data for Language Models

Zeyu Qin, Qingxiu Dong, Xingxing Zhang, Li Dong, Xiaolong Huang, Ziyi Yang, Mahmoud Khademi, Dongdong Zhang, Hany Hassan Awadalla, Yi R. Fung, Weizhu Chen, Minhao Cheng, Furu Wei

Main category: cs.CL

TL;DR: SynthLLM framework generates synthetic data that follows predictable scaling laws, with performance plateauing around 300B tokens and larger models requiring fewer tokens to reach optimal performance.

DetailsMotivation: Web data for LLM pre-training is rapidly depleting, creating need for scalable synthetic data alternatives that maintain predictable performance scaling.

Method: SynthLLM framework that transforms pre-training corpora into synthetic datasets by automatically extracting and recombining high-level concepts across documents using graph algorithms.

Result: Synthetic data reliably follows rectified scaling law across model sizes, performance plateaus near 300B tokens, larger models need fewer training tokens (8B model peaks at 1T tokens, 3B needs 4T), and outperforms existing synthetic data methods.

Conclusion: Synthetic data is a scalable and reliable alternative to organic pre-training corpora, offering viable path for continued model performance improvement.

Abstract: Large language models (LLMs) achieve strong performance across diverse tasks, largely driven by high-quality web data used in pre-training. However, recent studies indicate this data source is rapidly depleting. Synthetic data emerges as a promising alternative, but it remains unclear whether synthetic datasets exhibit predictable scalability comparable to raw pre-training data. In this work, we systematically investigate the scaling laws of synthetic data by introducing SynthLLM, a scalable framework that transforms pre-training corpora into diverse, high-quality synthetic datasets. Our approach achieves this by automatically extracting and recombining high-level concepts across multiple documents using a graph algorithm. Key findings from our extensive mathematical experiments on SynthLLM include: (1) SynthLLM generates synthetic data that reliably adheres to the rectified scaling law across various model sizes; (2) Performance improvements plateau near 300B tokens; and (3) Larger models approach optimal performance with fewer training tokens. For instance, an 8B model peaks at 1T tokens, while a 3B model requires 4T. Moreover, comparisons with existing synthetic data generation and augmentation methods demonstrate that SynthLLM achieves superior performance and scalability. Our findings highlight synthetic data as a scalable and reliable alternative to organic pre-training corpora, offering a viable path toward continued improvement in model performance.

[142] RAG over Tables: Hierarchical Memory Index, Multi-Stage Retrieval, and Benchmarking

Jiaru Zou, Dongqi Fu, Sirui Chen, Xinrui He, Zihao Li, Yada Zhu, Jiawei Han, Jingrui He

Main category: cs.CL

TL;DR: T-RAG is a table-corpora-aware RAG framework that addresses multi-table knowledge retrieval and inference challenges through hierarchical memory indexing, multi-stage retrieval, and graph-aware prompting.

DetailsMotivation: Current RAG systems struggle with table-based knowledge where user questions require retrieving answers distributed across multiple tables, facing challenges in understanding intra/inter-table knowledge, efficient table filtering, LLM prompting for inference, and realistic evaluation.

Method: Proposed T-RAG framework with hierarchical memory index, multi-stage retrieval, and graph-aware prompting. Also developed MultiTableQA benchmark with 57,193 tables and 23,758 questions from real-world scenarios.

Result: T-RAG demonstrated leading performance in accuracy, recall, and running time compared to other table retrieval methods, RAG methods, and table-to-graph representation learning methods. Also evaluated LLM inference ability upgrades under T-RAG.

Conclusion: T-RAG effectively addresses multi-table knowledge retrieval challenges and shows superior performance in realistic table-corpora scenarios, providing a comprehensive solution for table-aware RAG systems.

Abstract: Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by integrating them with an external knowledge base to improve the answer relevance and accuracy. In real-world scenarios, beyond pure text, a substantial amount of knowledge is stored in tables, and user questions often require retrieving answers that are distributed across multiple tables. Retrieving knowledge from a table corpora (i.e., various individual tables) for a question remains nascent, at least, for (i) how to understand intra- and inter-table knowledge effectively, (ii) how to filter unnecessary tables and how to retrieve the most relevant tables efficiently, (iii) how to prompt LLMs to infer over the retrieval, (iv) how to evaluate the corresponding performance in a realistic setting. Facing the above challenges, in this paper, we first propose a table-corpora-aware RAG framework, named T-RAG, which consists of the hierarchical memory index, multi-stage retrieval, and graph-aware prompting for effective and efficient table knowledge retrieval and inference. Further, we first develop a multi-table question answering benchmark named MultiTableQA, which spans 3 different task types, 57,193 tables, and 23,758 questions in total, and the sources are all from real-world scenarios. Based on MultiTableQA, we did the holistic comparison over table retrieval methods, RAG methods, and table-to-graph representation learning methods, where T-RAG shows the leading accuracy, recall, and running time performance. Also, under T-RAG, we evaluate the inference ability upgrade of different LLMs. Code and Data are available at https://github.com/jiaruzouu/T-RAG

[143] Testing Low-Resource Language Support in LLMs Using Language Proficiency Exams: the Case of Luxembourgish

Cedric Lothritz, Jordi Cabot, Laura Bernardy

Main category: cs.CL

TL;DR: This study investigates using language proficiency exams to evaluate LLM performance in Luxembourgish, finding that large models like Claude and DeepSeek-R1 perform well while smaller models struggle, and that exam performance predicts performance on other NLP tasks.

DetailsMotivation: LLMs are predominantly developed for English and widespread languages, leaving less-resourced languages like Luxembourgish with sparse evaluation tools and datasets. The study aims to address this gap by exploring language proficiency exams as evaluation tools.

Method: The researchers used language proficiency exams to evaluate various LLMs’ performance in Luxembourgish, comparing large models (Claude, DeepSeek-R1) against smaller models.

Result: Large models achieved high scores on Luxembourgish proficiency exams, while smaller models showed weak performances. Exam performance was found to be predictive of performance on other NLP tasks in Luxembourgish.

Conclusion: Language proficiency exams are viable evaluation tools for LLMs in less-resourced languages like Luxembourgish, with large models demonstrating strong capabilities and exam performance serving as a predictor for broader NLP task performance.

Abstract: Large Language Models (LLMs) have become an increasingly important tool in research and society at large. While LLMs are regularly used all over the world by experts and lay-people alike, they are predominantly developed with English-speaking users in mind, performing well in English and other wide-spread languages while less-resourced languages such as Luxembourgish are seen as a lower priority. This lack of attention is also reflected in the sparsity of available evaluation tools and datasets. In this study, we investigate the viability of language proficiency exams as such evaluation tools for the Luxembourgish language. We find that large models such as Claude and DeepSeek-R1 typically achieve high scores, while smaller models show weak performances. We also find that the performances in such language exams can be used to predict performances in other NLP tasks in Luxembourgish.

[144] Multilingual Retrieval-Augmented Generation for Knowledge-Intensive Task

Leonardo Ranaldi, Barry Haddow, Alexandra Birch

Main category: cs.CL

TL;DR: This paper explores multilingual RAG strategies for open-domain QA, finding that CrossRAG (translating retrieved documents to a common language) outperforms question-translation (tRAG) and direct multilingual retrieval (MultiRAG) approaches.

DetailsMotivation: While RAG is effective in monolingual English settings, its performance in multilingual tasks remains unexplored, creating a gap for cross-lingual knowledge access.

Method: Proposed three multilingual RAG strategies: tRAG (translate questions to English), MultiRAG (direct multilingual retrieval), and CrossRAG (translate retrieved documents to common language).

Result: CrossRAG significantly enhances performance on knowledge-intensive tasks, benefiting both high-resource and low-resource languages, while tRAG has limited coverage and MultiRAG introduces inconsistencies.

Conclusion: CrossRAG effectively addresses cross-lingual variations in retrieved content and improves multilingual RAG performance across diverse languages.

Abstract: Retrieval-augmented generation (RAG) has become a cornerstone of contemporary NLP, enhancing large language models (LLMs) by allowing them to access richer factual contexts through in-context retrieval. While effective in monolingual settings, especially in English, its use in multilingual tasks remains unexplored. This paper investigates the effectiveness of RAG across multiple languages by proposing novel approaches for multilingual open-domain question-answering. We evaluate the performance of various multilingual RAG strategies, including question-translation (tRAG), which translates questions into English before retrieval, and Multilingual RAG (MultiRAG), where retrieval occurs directly across multiple languages. Our findings reveal that tRAG, while useful, suffers from limited coverage. In contrast, MultiRAG improves efficiency by enabling multilingual retrieval but introduces inconsistencies due to cross-lingual variations in the retrieved content. To address these issues, we propose Crosslingual RAG (CrossRAG), a method that translates retrieved documents into a common language (e.g., English) before generating the response. Our experiments show that CrossRAG significantly enhances performance on knowledge-intensive tasks, benefiting both high-resource and low-resource languages.

[145] AgentAda: Skill-Adaptive Data Analytics for Tailored Insight Discovery

Amirhossein Abaskohi, Amrutha Varshini Ramesh, Shailesh Nanisetty, Chirag Goel, David Vazquez, Christopher Pal, Spandana Gella, Giuseppe Carenini, Issam H. Laradji

Main category: cs.CL

TL;DR: AgentAda is the first LLM-powered analytics agent that automatically learns and applies specialized analytics skills from a library to extract insights, outperforming existing methods through a three-step process of question generation, skill matching, and code generation.

DetailsMotivation: Existing analytics methods require users to manually select appropriate analytical techniques, which is inefficient and limits the use of specialized skills that LLMs cannot perform natively. AgentAda aims to automate this process and handle complex analytics tasks automatically.

Method: AgentAda uses a three-step strategy: (1) question generator for relevant queries, (2) hybrid RAG-based skill matcher to select the best analytics skill from a library (including clustering, predictive modeling, NLP techniques like BERT), and (3) code generator that produces executable code based on skill documentation.

Result: Human evaluation showed 48.78% of evaluators preferred AgentAda’s analyses compared to 27.67% for unskilled agents. The paper also introduced KaggleBench benchmark and demonstrated that a novel LLM-as-a-judge approach aligns with human evaluation for automated insight quality assessment.

Conclusion: AgentAda successfully automates data analytics by learning and applying specialized skills, providing more insightful analyses than existing tools, with potential for scalable automated evaluation through LLM-as-a-judge methodology.

Abstract: We introduce AgentAda, the first LLM-powered analytics agent that can learn and use new analytics skills to extract more specialized insights. Unlike existing methods that require users to manually decide which data analytics method to apply, AgentAda automatically identifies the skill needed from a library of analytical skills to perform the analysis. This also allows AgentAda to use skills that existing LLMs cannot perform out of the box. The library covers a range of methods, including clustering, predictive modeling, and NLP techniques like BERT, which allow AgentAda to handle complex analytics tasks based on what the user needs. AgentAda’s dataset-to-insight extraction strategy consists of three key steps: (I) a question generator to generate queries relevant to the user’s goal and persona, (II) a hybrid Retrieval-Augmented Generation (RAG)-based skill matcher to choose the best data analytics skill from the skill library, and (III) a code generator that produces executable code based on the retrieved skill’s documentation to extract key patterns. We also introduce KaggleBench, a benchmark of curated notebooks across diverse domains, to evaluate AgentAda’s performance. We conducted a human evaluation demonstrating that AgentAda provides more insightful analytics than existing tools, with 48.78% of evaluators preferring its analyses, compared to 27.67% for the unskilled agent. We also propose a novel LLM-as-a-judge approach that we show is aligned with human evaluation as a way to automate insight quality evaluation at larger scale.

[146] Forecasting Conversation Derailments Through Generation

Yunfan Zhang, Kathleen McKeown, Smaranda Muresan

Main category: cs.CL

TL;DR: The paper proposes a method to forecast conversation derailment by sampling multiple future conversation trajectories using a fine-tuned LLM and predicting outcomes based on consensus, outperforming state-of-the-art approaches.

DetailsMotivation: Forecasting conversation derailment is valuable for applications like online content moderation, conflict resolution, and business negotiations, but current language models struggle with predicting future derailments despite being good at identifying offensive speech.

Method: Sample multiple future conversation trajectories conditioned on existing conversation history using a fine-tuned LLM, and predict conversation outcome based on the consensus of these trajectories. Also experimented with using socio-linguistic attributes as guidance for generating future conversations.

Result: The method surpasses state-of-the-art results on English conversation derailment prediction benchmarks and shows significant accuracy gains in ablation studies.

Conclusion: Generating multiple future conversation trajectories and using consensus-based prediction is an effective approach for forecasting conversation derailment, outperforming methods that rely solely on past conversation history.

Abstract: Forecasting conversation derailment can be useful in real-world settings such as online content moderation, conflict resolution, and business negotiations. However, despite language models’ success at identifying offensive speech present in conversations, they struggle to forecast future conversation derailments. In contrast to prior work that predicts conversation outcomes solely based on the past conversation history, our approach samples multiple future conversation trajectories conditioned on existing conversation history using a fine-tuned LLM. It predicts the conversation outcome based on the consensus of these trajectories. We also experimented with leveraging socio-linguistic attributes, which reflect turn-level conversation dynamics, as guidance when generating future conversations. Our method of future conversation trajectories surpasses state-of-the-art results on English conversation derailment prediction benchmarks and demonstrates significant accuracy gains in ablation studies.

[147] Deliberate Planning in Language Models with Symbolic Representation

Siheng Xiong, Zhangding Liu, Jieyu Zhou, Yusen Su

Main category: cs.CL

TL;DR: SymPlanner is a framework that enhances LLM planning by integrating symbolic environments for deterministic action execution and verification, using Iterative Correction and Contrastive Ranking to improve plan quality and robustness.

DetailsMotivation: Planning remains challenging for LLMs, especially in domains requiring multi-step action sequences grounded in external constraints, as pure natural language reasoning lacks structured world modeling.

Method: SymPlanner interfaces LLMs with symbolic environments as explicit world models, using a policy model to propose actions and symbolic environment for deterministic execution. It employs Iterative Correction to refine actions based on feedback and Contrastive Ranking for fine-grained plan comparison.

Result: Evaluation on PlanBench shows SymPlanner produces more coherent, diverse, and verifiable plans than pure natural language baselines.

Conclusion: SymPlanner operationalizes cognitive faculties of error monitoring/repair and preference formation, advancing symbol-grounded planning aligned with intelligent system structure.

Abstract: Planning remains a core challenge for large language models (LLMs), particularly in domains that require coherent multi-step action sequences grounded in external constraints. We introduce SymPlanner, a novel framework that equips LLMs with structured planning capabilities by interfacing them with a symbolic environment that serves as an explicit world model. Rather than relying purely on natural language reasoning, SymPlanner grounds the planning process in a symbolic state space, where a policy model proposes actions and a symbolic environment deterministically executes and verifies their effects. To enhance exploration and improve robustness, we introduce Iterative Correction (IC), which refines previously proposed actions by leveraging feedback from the symbolic environment to eliminate invalid decisions and guide the model toward valid alternatives. Additionally, Contrastive Ranking (CR) enables fine-grained comparison of candidate plans by evaluating them jointly. Conceptually, SymPlanner operationalizes two cognitive faculties: (i) error monitoring and repair via externalized feedback (IC) and (ii) preference formation among alternatives via pairwise comparison (CR), advancing cognitively plausible, symbol-grounded planning aligned with the rich structure in intelligent systems. We evaluate SymPlanner on PlanBench, demonstrating that it produces more coherent, diverse, and verifiable plans than pure natural language baselines.

[148] SCAN: Structured Capability Assessment and Navigation for LLMs

Zongqi Wang, Tianle Gu, Chen Gong, Xin Tian, Siqi Bao, Yujiu Yang

Main category: cs.CL

TL;DR: SCAN is a framework for fine-grained evaluation of LLMs that automatically builds capability taxonomies, generates evaluation data, provides visualization tools, and uses an improved LLM-as-a-Judge method for comprehensive model assessment.

DetailsMotivation: Existing LLM evaluation benchmarks focus on approximating model rankings but fail to provide comprehensive and fine-grained understanding of specific model capabilities, leaving users and developers without detailed insights into model strengths and weaknesses.

Method: SCAN framework includes: (1) TaxBuilder for automatic hierarchical taxonomy construction from capability-indicating tags, (2) RealMix for query synthesis and filtering to ensure sufficient evaluation data, (3) visualization tools for capability analysis, and (4) PC^2-based LLM-as-a-Judge approach for higher accuracy evaluation.

Result: Comprehensive evaluation of 21 mainstream LLMs revealed substantial performance variations even within sub-capabilities of the same category, demonstrating the importance of fine-grained evaluation for accurate understanding of LLM behavior.

Conclusion: SCAN enables detailed characterization of LLM capabilities through fine-grained evaluation, highlighting that coarse-grained assessments miss important performance variations and that comprehensive capability analysis is essential for understanding model behavior.

Abstract: Evaluating Large Language Models (LLMs) has become increasingly important, with automatic evaluation benchmarks gaining prominence as alternatives to human evaluation. While existing research has focused on approximating model rankings, such benchmarks fail to provide users and developers with a comprehensive and fine-grained understanding of a specific model’s capabilities. To fill this gap, we propose \textbf{SCAN} (Structured Capability Assessment and Navigation), a practical framework that enables detailed characterization of LLM capabilities through comprehensive and fine-grained evaluation. SCAN incorporates four key components: (1) TaxBuilder, which extracts capability-indicating tags from extensive queries to construct a hierarchical taxonomy automatically; (2) RealMix, a query synthesis and filtering mechanism that ensures sufficient evaluation data for each capability tag; (3) a suite of visualization and analysis tools that facilitate efficient navigation and analysis of model capabilities; and (4) a PC$^2$-based (Pre-Comparison-derived Criteria) LLM-as-a-Judge approach that achieves significantly higher accuracy compared to classic LLM-as-a-Judge method. Using SCAN, we conduct a comprehensive evaluation of 21 mainstream LLMs. Our detailed analysis of the GPT-OSS family reveals substantial performance variations, even within sub-capabilities belonging to the same category of capability. This finding highlights the importance of fine-grained evaluation in accurately understanding LLM behavior. Project homepage and resources are available at \href{https://liudan193.github.io/Feedbacker/}{https://liudan193.github.io/Feedbacker/}.

[149] J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning

Chenxi Whitehouse, Tianlu Wang, Ping Yu, Xian Li, Jason Weston, Ilia Kulikov, Swarnadeep Saha

Main category: cs.CL

TL;DR: J1 is a reinforcement learning framework that teaches LLM judges to think before making decisions, achieving state-of-the-art performance across benchmarks and outperforming larger models like o1-mini, o3, and DeepSeek-R1.

DetailsMotivation: AI progress is bottlenecked by evaluation quality, making powerful LLM-as-a-Judge models essential. Their effectiveness depends on chain-of-thought reasoning, creating a need for methods to optimize this reasoning process.

Method: Convert all judgment tasks into a unified format with verifiable rewards, then use RL to train thinking-judges at 8B, 32B, and 70B scales. The approach includes multitasked pointwise and pairwise judgment optimization.

Result: J1-Qwen-32B outperforms o1-mini, o3, and 671B DeepSeek-R1 on some benchmarks while training only on synthetic data. Comprehensive ablations demonstrate effectiveness across seed prompts, reward strategies, and training recipes.

Conclusion: J1 develops systematic evaluation strategies including dynamic criteria generation, reference answer creation, iterative self-correction, and feedback generation, proving the effectiveness of RL-based optimization for LLM judges.

Abstract: The progress of AI is bottlenecked by the quality of evaluation, making powerful LLM-as-a-Judge models a core solution. The efficacy of these judges depends on their chain-of-thought reasoning, creating a critical need for methods that can effectively optimize this reasoning process. In this work, we introduce J1, a reinforcement learning framework for teaching LLM judges to think before making decisions. Our core contribution lies in converting all judgment tasks for non-verifiable and verifiable prompts into a unified format with verifiable rewards, enabling direct optimization of evaluation quality while mitigating positional bias. We then use RL to train thinking-judges at scales of 8B, 32B, and 70B and show that they obtain state-of-the-art performance across multiple benchmarks. In particular, J1-Qwen-32B, our multitasked pointwise and pairwise judge also outperforms o1-mini, o3, and a much larger 671B DeepSeek-R1 on some benchmarks, while only training on synthetic data. Through comprehensive ablations of pairwise, pointwise, and multitask J1 variants, we demonstrate the effectiveness of our approach across seed prompts, reward strategies, and training recipes. Qualitative analysis reveals that J1 develops systematic evaluation strategies, including dynamic criteria generation, reference answer creation, iterative self-correction of initial assessments, and feedback generation for low-quality responses.

[150] DACL-RAG: Data Augmentation Strategy with Curriculum Learning for Retrieval-Augmented Generation

Shaohan Wang, Licheng Zhang, Zheren Fu, Zhendong Mao, Yongdong Zhang

Main category: cs.CL

TL;DR: DACL-RAG is a multi-stage training framework that combines data augmentation and curriculum learning to improve RAG systems by addressing issues with varying document quality and low discriminability in top-k retrieved documents.

DetailsMotivation: Existing RAG methods suffer from two key issues: (1) varying quality of top-k retrieved documents across queries, which can impair the generator's ability to extract key information, and (2) low discriminability of retrieved documents for a given query, making it difficult for the retriever to distinguish relevant from irrelevant documents.

Method: DACL-RAG uses a multi-stage framework combining multi-level data augmentation (constructing diverse training sets with controllable difficulty through sample evolution) and multi-stage curriculum learning (organizing training data into progressive stages for stable improvement).

Result: The framework achieves consistent effectiveness across four open-domain QA datasets, with performance gains of 2% to 4% over multiple advanced methods.

Conclusion: DACL-RAG effectively optimizes RAG system performance and generalization by addressing data quality and discriminability issues through systematic data augmentation and curriculum learning.

Abstract: Retrieval-Augmented Generation (RAG) is an effective method to enhance the capabilities of large language models (LLMs). Existing methods typically optimize the retriever or the generator in a RAG system by directly using the top-k retrieved documents. However, two key issues inherent in the training data constrain the effectiveness of this training paradigm: (1) across different queries, the top-k retrieved documents vary greatly in content quality, with some providing valuable knowledge while others lack critical information or are even misleading, and training on such data in a purely random manner may impair the generator’s ability to extract key information; (2) for a given query, the limited set of k documents often exhibits low discriminability, and training solely on them makes it difficult for the retriever to learn how to distinguish between relevant and irrelevant documents. To address these issues, we introduce DACL-RAG, a multi-stage RAG training framework that combines a multi-level Data Augmentation strategy with a multi-stage Curriculum Learning paradigm. The data augmentation strategy constructs comprehensive and diverse training sets with controllable difficulty levels through sample evolution, while the curriculum learning paradigm organizes them into progressive stages for training, ensuring stable and consistent improvements, thereby optimizing the overall performance and generalization of the RAG system more effectively. Our DACL-RAG framework demonstrates consistent effectiveness across four open-domain QA datasets, achieving performance gains of 2% to 4% over multiple advanced methods.

[151] Self-GIVE: Associative Thinking from Limited Structured Knowledge for Enhanced Large Language Model Reasoning

Jiashu He, Jinxuan Fan, Bowen Jiang, Ignacio Houine, Dan Roth, Alejandro Ribeiro

Main category: cs.CL

TL;DR: Self-GIVE is a retrieve-RL framework that enhances LLMs with automatic associative thinking through reinforcement learning, addressing efficiency and scalability limitations of previous GIVE method.

DetailsMotivation: LLMs need associative thinking for scientific questions when retrieved knowledge is insufficient, but existing GIVE method has efficiency and generalizability limitations due to extensive LLM calls and complex instructions.

Method: Propose Self-GIVE using reinforcement learning to extract structured information and entity sets, enabling automatic associative thinking without extensive hypothetical triplet construction and pruning.

Result: Improves Qwen2.5 3B and 7B models by up to 28.5%→71.4% and 78.6%→90.5% in unseen biomedical QA tasks, with 7B model matching/exceeding GPT3.5 turbo performance while reducing token usage by over 90%.

Conclusion: Self-GIVE enables scalable integration of structured retrieval and reasoning with associative thinking, making it practical for smaller LLMs while maintaining high performance.

Abstract: When addressing complex questions that require new information, people often associate the question with existing knowledge to derive a sensible answer. For instance, when evaluating whether melatonin aids insomnia, one might associate “hormones helping mental disorders” with “melatonin being a hormone and insomnia a mental disorder” to complete the reasoning. Large Language Models (LLMs) also require such associative thinking, particularly in resolving scientific inquiries when retrieved knowledge is insufficient and does not directly answer the question. Graph Inspired Veracity Extrapolation (GIVE) addresses this by using a knowledge graph (KG) to extrapolate structured knowledge. However, it involves the construction and pruning of many hypothetical triplets, which limits efficiency and generalizability. We propose Self-GIVE, a retrieve-RL framework that enhances LLMs with automatic associative thinking through reinforcement learning. Self-GIVE extracts structured information and entity sets to assist the model in linking to the queried concepts. We address GIVE’s key limitations: (1) extensive LLM calls and token overhead for knowledge extrapolation, (2) difficulty in deploying on smaller LLMs (3B or 7B) due to complex instructions, and (3) inaccurate knowledge from LLM pruning. Specifically, after fine-tuning using self-GIVE with a 135 node UMLS KG, it improves the performance of the Qwen2.5 3B and 7B models by up to $\textbf{28.5%$\rightarrow$71.4%}$ and $\textbf{78.6$\rightarrow$90.5%}$ in samples $\textbf{unseen}$ in challenging biomedical QA tasks. In particular, Self-GIVE allows the 7B model to match or outperform GPT3.5 turbo with GIVE, while cutting token usage by over 90%. Self-GIVE enhances the scalable integration of structured retrieval and reasoning with associative thinking.

[152] TACO: Enhancing Multimodal In-context Learning via Task Mapping-Guided Sequence Configuration

Yanshu Li, Jianjiang Yang, Tian Yun, Pinyuan Feng, Jinfa Huang, Ruixiang Tang

Main category: cs.CL

TL;DR: TACO is a transformer-based model that improves multimodal in-context learning by using task-aware attention to dynamically configure ICL sequences based on task mapping insights.

DetailsMotivation: Multimodal ICL effectiveness is highly sensitive to input sequence quality, and there's limited understanding of how LVLMs actually exploit these sequences during inference.

Method: Systematically interpret multimodal ICL through task mapping, then develop TACO with task-aware attention that dynamically configures ICL sequences by injecting task-mapping signals into autoregressive decoding.

Result: TACO consistently surpasses baselines across five LVLMs and nine datasets on diverse ICL tasks.

Conclusion: Task mapping provides a novel and valuable perspective for interpreting and improving multimodal ICL.

Abstract: Multimodal in-context learning (ICL) has emerged as a key mechanism for harnessing the capabilities of large vision-language models (LVLMs). However, its effectiveness remains highly sensitive to the quality of input ICL sequences, particularly for tasks involving complex reasoning or open-ended generation. A major limitation is our limited understanding of how LVLMs actually exploit these sequences during inference. To bridge this gap, we systematically interpret multimodal ICL through the lens of task mapping, which reveals how local and global relationships within and among demonstrations guide model reasoning. Building on this insight, we present TACO, a lightweight transformer-based model equipped with task-aware attention that dynamically configures ICL sequences. By injecting task-mapping signals into the autoregressive decoding process, TACO creates a bidirectional synergy between sequence construction and task reasoning. Experiments on five LVLMs and nine datasets demonstrate that TACO consistently surpasses baselines across diverse ICL tasks. These results position task mapping as a novel and valuable perspective for interpreting and improving multimodal ICL.

[153] A quantitative analysis of semantic information in deep representations of text and images

Santiago Acevedo, Andrea Mascaretti, Riccardo Rende, Matéo Mahaut, Marco Baroni, Alessandro Laio

Main category: cs.CL

TL;DR: This paper presents a method to measure how deep neural networks encode semantic information across different domains (text, images) and identifies specific “semantic” layers in LLMs and vision transformers that contain the most transferable information.

DetailsMotivation: To quantitatively investigate how neural networks develop similar representations for semantically related data across different domains (e.g., images and their descriptions, text in different languages) and understand how semantic information is encoded in model representations.

Method: Developed a method to measure relative information content of representations for semantically related data. Analyzed how LLMs process translated sentence pairs and how vision transformers process images. Identified inner “semantic” layers containing most language-transferable information and studied token-level information distribution and correlations.

Result: Identified specific semantic layers in LLMs that contain the most transferable information. Found that larger LLMs (DeepSeek-V3) extract significantly more general information than smaller ones (Llama3.1-8B). Semantic information in English text is spread across many tokens with long-distance correlations and causal left-to-right asymmetry. Also identified semantic layers in vision transformers and showed that caption representations in LLMs predict visual representations of corresponding images.

Conclusion: The study reveals systematic patterns in how semantic information is encoded across different neural network architectures, with model-dependent information asymmetries between image and text representations, providing quantitative insights into cross-modal semantic representations.

Abstract: Deep neural networks are known to develop similar representations for semantically related data, even when they belong to different domains, such as an image and its description, or the same text in different languages. We present a method for quantitatively investigating this phenomenon by measuring the relative information content of the representations of semantically related data and probing how it is encoded into multiple tokens of large language models (LLMs) and vision transformers. Looking first at how LLMs process pairs of translated sentences, we identify inner ``semantic’’ layers containing the most language-transferable information. We find moreover that, on these layers, a larger LLM (DeepSeek-V3) extracts significantly more general information than a smaller one (Llama3.1-8B). Semantic information of English text is spread across many tokens and it is characterized by long-distance correlations between tokens and by a causal left-to-right (i.e., past-future) asymmetry. We also identify layers encoding semantic information within visual transformers. We show that caption representations in the semantic layers of LLMs predict visual representations of the corresponding images. We observe significant and model-dependent information asymmetries between image and text representations.

[154] From Compression to Expression: A Layerwise Analysis of In-Context Learning

Jiachen Jiang, Yuxin Dong, Jinxin Zhou, Zhihui Zhu

Main category: cs.CL

TL;DR: The paper analyzes in-context learning (ICL) in LLMs through statistical geometric analysis, revealing a ‘Layerwise Compression-Expression’ phenomenon where early layers compress task information and later layers express it for predictions.

DetailsMotivation: To understand the internal representational mechanisms of ICL in LLMs, as while ICL shows strong empirical performance, its internal workings are not well understood.

Method: Conducted statistical geometric analysis of ICL representations across layers, proposed bias-variance decomposition, and provided theoretical analysis of attention mechanisms.

Result: Discovered consistent Layerwise Compression-Expression phenomenon across diverse tasks and LLM architectures, showing improved performance with model size and demonstration count, and enhanced robustness to noise.

Conclusion: The findings reveal layerwise dynamics in ICL, show how structured representations emerge in LLMs, and demonstrate that analyzing internal representations provides deeper understanding of model behavior.

Abstract: In-context learning (ICL) enables large language models (LLMs) to adapt to new tasks without weight updates by learning from demonstration sequences. While ICL shows strong empirical performance, its internal representational mechanisms are not yet well understood. In this work, we conduct a statistical geometric analysis of ICL representations to investigate how task-specific information is captured across layers. Our analysis reveals an intriguing phenomenon, which we term Layerwise Compression-Expression: early layers progressively produce compact and discriminative representations that encode task information from the input demonstrations, while later layers express these representations to incorporate the query and generate the prediction. This phenomenon is observed consistently across diverse tasks and a range of contemporary LLM architectures. We demonstrate that it has important implications for ICL performance – improving with model size and the number of demonstrations – and for robustness in the presence of noisy examples. To further understand the effect of the compact task representation, we propose a bias-variance decomposition and provide a theoretical analysis showing how attention mechanisms contribute to reducing both variance and bias, thereby enhancing performance as the number of demonstrations increases. Our findings reveal an intriguing layerwise dynamic in ICL, highlight how structured representations emerge within LLMs, and showcase that analyzing internal representations can facilitate a deeper understanding of model behavior.

[155] Revisiting Backdoor Attacks on LLMs: A Stealthy and Practical Poisoning Framework via Harmless Inputs

Jiawei Kong, Hao Fang, Xiaochen Yang, Kuofeng Gao, Bin Chen, Shu-Tao Xia, Ke Xu, Han Qiu

Main category: cs.CL

TL;DR: A novel backdoor attack method that uses only harmless training data to establish associations between triggers and affirmative response prefixes, enabling LLMs to generate harmful content when triggered while evading safety detection.

DetailsMotivation: Existing backdoor attacks compromise safety alignment by embedding harmful content directly, making them detectable by safety guardrails. This work aims to create more stealthy attacks using only benign data.

Method: Use causal reasoning to associate triggers with affirmative response prefixes through benign QA pairs, design robust response templates to overcome shallow alignment resistance, and optimize universal triggers via gradient-based coordinate optimization.

Result: Successfully injects backdoors into various LLMs for harmful content generation, even under detection by powerful guardrail models.

Conclusion: The proposed method demonstrates that backdoor attacks can be effectively implemented using only harmless training data, highlighting vulnerabilities in current LLM safety mechanisms.

Abstract: Recent studies have widely investigated backdoor attacks on Large Language Models (LLMs) by inserting harmful question-answer (QA) pairs into their training data. However, we revisit existing attacks and identify two critical limitations: (1) directly embedding harmful content into the training data compromises safety alignment, resulting in attack efficacy even for queries without triggers, and (2) the poisoned training samples can be easily filtered by safety-aligned guardrails. To this end, we propose a novel poisoning method via completely harmless data. Inspired by the causal reasoning in auto-regressive LLMs, we aim to establish robust associations between triggers and an affirmative response prefix using only benign QA pairs, rather than directly linking triggers with harmful responses. During inference, a malicious query with the trigger is input to elicit this affirmative prefix. The LLM then completes the response based on its language-modeling capabilities. Achieving this using only clean samples is non-trivial. We observe an interesting resistance phenomenon where the LLM initially appears to agree but subsequently refuses to answer. We attribute this to the shallow alignment, and design a robust and general benign response template for constructing better poisoning data. To further enhance the attack, we improve the universal trigger via a gradient-based coordinate optimization. Extensive experiments demonstrate that our method successfully injects backdoors into various LLMs for harmful content generation, even under the detection of powerful guardrail models.

[156] From Word to World: Evaluate and Mitigate Culture Bias in LLMs via Word Association Test

Xunlian Dai, Li Zhou, Benyou Wang, Haizhou Li

Main category: cs.CL

TL;DR: The paper introduces CultureSteer, a method to improve LLMs’ cross-cultural alignment by embedding cultural-specific semantic associations in their internal representations, addressing Western bias in word association tasks.

DetailsMotivation: Current LLMs exhibit significant bias toward Western cultural schemas at the word association level, limiting their cross-cultural cognitive alignment and inclusivity.

Method: Propose CultureSteer, an approach that embeds cultural-specific semantic associations directly within LLMs’ internal representation space, extending human-centered word association tests into LLM-adaptive free-relation tasks.

Result: CultureSteer substantially improves cross-cultural alignment, capturing diverse semantic associations and showing efficacy in culture-sensitive downstream tasks.

Conclusion: This work contributes a novel methodological paradigm for enhancing cultural awareness in LLMs, advancing the development of more inclusive language technologies.

Abstract: The human-centered word association test (WAT) serves as a cognitive proxy, revealing sociocultural variations through culturally shared semantic expectations and implicit linguistic patterns shaped by lived experiences. We extend this test into an LLM-adaptive, free-relation task to assess the alignment of large language models (LLMs) with cross-cultural cognition. To address culture preference, we propose CultureSteer, an innovative approach that moves beyond superficial cultural prompting by embedding cultural-specific semantic associations directly within the model’s internal representation space. Experiments show that current LLMs exhibit significant bias toward Western (notably American) schemas at the word association level. In contrast, our model substantially improves cross-cultural alignment, capturing diverse semantic associations. Further validation on culture-sensitive downstream tasks confirms its efficacy in fostering cognitive alignment across cultures. This work contributes a novel methodological paradigm for enhancing cultural awareness in LLMs, advancing the development of more inclusive language technologies.

[157] Social Good or Scientific Curiosity? Uncovering the Research Framing Behind NLP Artefacts

Eric Chamoun, Nedjma Ousidhoum, Michael Schlichtkrull, Andreas Vlachos

Main category: cs.CL

TL;DR: Automated system for analyzing NLP research framings by extracting key elements and linking them through interpretable rules, achieving better performance than LLM baselines and revealing trends in automated fact-checking research.

DetailsMotivation: To automate the analysis of NLP research framings since manual studies show few papers explicitly identify stakeholders, intended uses, or appropriate contexts, which is crucial for aligning research with practical applications.

Method: Three-component system that first extracts key elements (means, ends, stakeholders), then links them through interpretable rules and contextual reasoning.

Result: Achieved consistent improvements over strong LLM baselines on two domains: automated fact-checking and hate speech detection. Application to recent fact-checking papers revealed trends including vague research goals, emphasis on scientific exploration over application, and shift toward supporting human fact-checkers.

Conclusion: The proposed automated system effectively analyzes NLP research framings, outperforms LLM baselines, and can uncover important trends in research directions and focus areas.

Abstract: Clarifying the research framing of NLP artefacts (e.g., models, datasets, etc.) is crucial to aligning research with practical applications. Recent studies manually analyzed NLP research across domains, showing that few papers explicitly identify key stakeholders, intended uses, or appropriate contexts. In this work, we propose to automate this analysis, developing a three-component system that infers research framings by first extracting key elements (means, ends, stakeholders), then linking them through interpretable rules and contextual reasoning. We evaluate our approach on two domains: automated fact-checking using an existing dataset, and hate speech detection for which we annotate a new dataset-achieving consistent improvements over strong LLM baselines. Finally, we apply our system to recent automated fact-checking papers and uncover three notable trends: a rise in vague or underspecified research goals, increased emphasis on scientific exploration over application, and a shift toward supporting human fact-checkers rather than pursuing full automation.

[158] MEDAL: A Framework for Benchmarking LLMs as Multilingual Open-Domain Dialogue Evaluators

John Mendonça, Alon Lavie, Isabel Trancoso

Main category: cs.CL

TL;DR: MEDAL is an automated multi-agent framework for creating diverse multilingual dialogue evaluation benchmarks, revealing that current LLM judges fail to detect nuanced issues like lack of empathy, commonsense, or relevance.

DetailsMotivation: Existing meta-evaluation benchmarks for open-domain chatbots are static, outdated, and lack multilingual coverage, limiting their ability to capture subtle weaknesses in evaluation.

Method: Leverage multiple state-of-the-art LLMs to generate user-chatbot multilingual dialogues from varied seed contexts, then use GPT-4.1 for multidimensional performance analysis to uncover cross-lingual differences and curate a human-annotated meta-evaluation benchmark.

Result: Uncovered noticeable cross-lingual performance differences and found that state-of-the-art judges fail to reliably detect nuanced issues such as lack of empathy, commonsense, or relevance.

Conclusion: MEDAL provides a more representative and diverse evaluation framework that reveals significant limitations in current LLM-based evaluators for open-domain dialogue systems.

Abstract: Evaluating the quality of open-domain chatbots has become increasingly reliant on LLMs acting as automatic judges. However, existing meta-evaluation benchmarks are static, outdated, and lacking in multilingual coverage, limiting their ability to fully capture subtle weaknesses in evaluation. We introduce MEDAL, an automated multi-agent framework for curating more representative and diverse open-domain dialogue evaluation benchmarks. Our approach leverages several state-of-the-art LLMs to generate user-chatbot multilingual dialogues, conditioned on varied seed contexts. Then, a strong LLM (GPT-4.1) is used for a multidimensional analysis of the performance of the chatbots, uncovering noticeable cross-lingual performance differences. Guided by this large-scale evaluation, we curate a new meta-evaluation multilingual benchmark and human-annotate samples with nuanced quality judgments. This benchmark is then used to assess the ability of several reasoning and non-reasoning LLMs to act as evaluators of open-domain dialogues. Using MEDAL, we uncover that state-of-the-art judges fail to reliably detect nuanced issues such as lack of empathy, commonsense, or relevance.

[159] What Has Been Lost with Synthetic Evaluation?

Alexander Gill, Abhilasha Ravichander, Ana Marasović

Main category: cs.CL

TL;DR: LLMs can generate valid evaluation benchmarks at lower cost than crowdsourcing, but these benchmarks are less challenging for LLMs than human-authored ones.

DetailsMotivation: To investigate whether LLMs can meet the demands of creating high-quality evaluation benchmarks that target specific phenomena, avoid shortcuts, and provide sufficient challenge.

Method: Used two case studies with reasoning benchmarks: CondaQA (negation reasoning) and DROP (quantity reasoning). Generated LLM-based variants and compared them to original human-crowdsourced datasets.

Result: LLM-generated benchmarks were often valid according to annotation guidelines and created at a fraction of the cost, but were significantly less challenging for LLMs than human-authored counterparts.

Conclusion: LLM-generated evaluation data may lack the challenge of human-authored benchmarks, calling for critical reassessment of using LLMs for benchmark creation.

Abstract: Large language models (LLMs) are increasingly used for data generation. However, creating evaluation benchmarks raises the bar for this emerging paradigm. Benchmarks must target specific phenomena, penalize exploiting shortcuts, and be challenging. Through two case studies, we investigate whether LLMs can meet these demands by generating reasoning over-text benchmarks and comparing them to those created through careful crowdsourcing. Specifically, we evaluate both the validity and difficulty of LLM-generated versions of two high-quality reading comprehension datasets: CondaQA, which evaluates reasoning about negation, and DROP, which targets reasoning about quantities. We find that prompting LLMs can produce variants of these datasets that are often valid according to the annotation guidelines, at a fraction of the cost of the original crowdsourcing effort. However, we show that they are less challenging for LLMs than their human-authored counterparts. This finding sheds light on what may have been lost by generating evaluation data with LLMs, and calls for critically reassessing the immediate use of this increasingly prevalent approach to benchmark creation.

Chaeeun Kim, Jinu Lee, Wonseok Hwang

Main category: cs.CL

TL;DR: This paper introduces LEGAR BENCH, a large-scale Korean Legal Case Retrieval benchmark with 1.2M cases and 411 crime types, and LegalSearchLM, a retrieval model that performs legal element reasoning and constrained decoding to improve case retrieval performance.

DetailsMotivation: Existing Legal Case Retrieval studies have limitations: small-scale corpora, narrow criminal query types, and reliance on embedding-based/lexical matching methods that produce limited representations and legally irrelevant matches.

Method: Proposed LegalSearchLM performs legal element reasoning over query cases and directly generates content containing those elements, using constrained decoding to ground the generation in target cases.

Result: LegalSearchLM outperforms baselines by 6-20% on LEGAR BENCH, achieving state-of-the-art performance, and shows strong generalization to out-of-domain cases, outperforming naive generative models by 15%.

Conclusion: The proposed LegalSearchLM with legal element reasoning and constrained decoding effectively addresses limitations of existing LCR methods and demonstrates superior performance on the new large-scale LEGAR BENCH.

Abstract: Legal Case Retrieval (LCR), which retrieves relevant cases from a query case, is a fundamental task for legal professionals in research and decision-making. However, existing studies on LCR face two major limitations. First, they are evaluated on relatively small-scale retrieval corpora (e.g., 100-55K cases) and use a narrow range of criminal query types, which cannot sufficiently reflect the complexity of real-world legal retrieval scenarios. Second, their reliance on embedding-based or lexical matching methods often results in limited representations and legally irrelevant matches. To address these issues, we present: (1) LEGAR BENCH, the first large-scale Korean LCR benchmark, covering 411 diverse crime types in queries over 1.2M candidate cases; and (2) LegalSearchLM, a retrieval model that performs legal element reasoning over the query case and directly generates content containing those elements, grounded in the target cases through constrained decoding. Experimental results show that LegalSearchLM outperforms baselines by 6-20% on LEGAR BENCH, achieving state-of-the-art performance. It also demonstrates strong generalization to out-of-domain cases, outperforming naive generative models trained on in-domain data by 15%.

[161] Improve MLLM Benchmark Efficiency through Interview

Farong Wen, Yijin Guo, Junying Wang, Jiaohao Xiao, Yingjie Zhou, Ye Shen, Qi Jia, Chunyi Li, Zicheng Zhang

Main category: cs.CL

TL;DR: The paper proposes MLLM Interview (MITV) strategy to efficiently evaluate Multimodal Large Language Models by testing fewer questions through difficulty-labeled datasets and adaptive testing.

DetailsMotivation: Full-coverage Q&A testing on large-scale MLLM benchmark datasets is resource-intensive and time-consuming, requiring a more efficient evaluation method.

Method: Constructed interview dataset with difficulty labels based on typical MLLM performance, then implemented MITV strategy that quizzes small number of topics initially and continuously tests model limits.

Result: MITV strategy performs well on MLLM benchmark datasets and obtains model evaluation capability faster through fewer questions and answers.

Conclusion: The proposed MITV strategy provides an efficient alternative to comprehensive testing, enabling quicker assessment of MLLM performance with reduced resource requirements.

Abstract: The rapid development of Multimodal Large Language Models (MLLM) has led to a wide range of MLLM applications, and a number of benchmark datasets have sprung up in order to assess MLLM abilities. However, full-coverage Q&A testing on large-scale data is resource-intensive and time-consuming. To address this issue, we propose the MLLM Interview (MITV) strategy, which aims to quickly obtain MLLM performance metrics by quizzing fewer question. First, First, we constructed the interview dataset, which was built on an existing MLLM assessment dataset, by adding difficulty labels based on the performance of some typical MLLMs in this dataset. Second, we propose an MLLM Interview strategy, which obtains an initial performance situation of the large model by quizzing a small number of topics and then continuously tries to test the model’s limits. Through extensive experiments, the result shows that the MITV strategy proposed in this paper performs well on MLLM benchmark datasets, and it is able to obtain the model evaluation capability faster through a small number of questions and answers.

[162] SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning

Zhongwei Wan, Zhihao Dou, Che Liu, Yu Zhang, Dongfei Cui, Qinjian Zhao, Hui Shen, Jing Xiong, Yi Xin, Yifan Jiang, Chaofan Tao, Yangfan He, Mi Zhang, Shen Yan

Main category: cs.CL

TL;DR: SRPO is a two-stage RL framework that enhances multimodal LLM reasoning through self-reflection, using a novel reward mechanism to generate concise and meaningful reflections, achieving state-of-the-art performance on multiple benchmarks.

DetailsMotivation: Multimodal LLMs struggle with complex reasoning tasks requiring self-reflection and self-correction compared to text-only models, and existing reflection methods are simplistic and fail to generate meaningful feedback.

Method: Two-stage reflection-aware RL framework: (1) construct reflection-focused dataset using advanced MLLM to generate reflections, (2) introduce novel reward mechanism in GRPO framework to encourage concise and cognitively meaningful reflection.

Result: Significant performance improvements on MathVista, MathVision, MathVerse, and MMMU-Pro benchmarks using Qwen-2.5-VL models, outperforming state-of-the-art models in both reasoning accuracy and reflection quality.

Conclusion: SRPO effectively enhances multimodal reasoning through structured self-reflection, demonstrating that reflection-aware RL can overcome limitations of pre-trained models and improve both reasoning and reflection capabilities.

Abstract: Multimodal large language models (MLLMs) have shown promising capabilities in reasoning tasks, yet still struggle with complex problems requiring explicit self-reflection and self-correction, especially compared to their unimodal text-based counterparts. Existing reflection methods are simplistic and struggle to generate meaningful and instructive feedback, as the reasoning ability and knowledge limits of pre-trained models are largely fixed during initial training. To overcome these challenges, we propose Multimodal Self-Reflection enhanced reasoning with Group Relative Policy Optimization (SRPO), a two-stage reflection-aware reinforcement learning (RL) framework explicitly designed to enhance multimodal LLM reasoning. In the first stage, we construct a high-quality, reflection-focused dataset under the guidance of an advanced MLLM, which generates reflections based on initial responses to help the policy model learn both reasoning and self-reflection. In the second stage, we introduce a novel reward mechanism within the GRPO framework that encourages concise and cognitively meaningful reflection while avoiding redundancy. Extensive experiments across multiple multimodal reasoning benchmarks, including MathVista, MathVision, MathVerse, and MMMU-Pro, using Qwen-2.5-VL-7B and Qwen-2.5-VL-32B demonstrate that SRPO significantly outperforms state-of-the-art models, achieving notable improvements in both reasoning accuracy and reflection quality.

[163] MedAgentGym: A Scalable Agentic Training Environment for Code-Centric Reasoning in Biomedical Data Science

Ran Xu, Yuchen Zhuang, Yishan Zhong, Yue Yu, Zifeng Wang, Xiangru Tang, Hang Wu, May D. Wang, Peifeng Ruan, Donghan Yang, Tao Wang, Guanghua Xiao, Xin Liu, Carl Yang, Yang Xie, Wenqi Shi

Main category: cs.CL

TL;DR: MedAgentGym is a scalable training environment with 72,413 biomedical tasks across 129 categories that enhances LLM agents’ coding-based biomedical reasoning through interactive sandbox environments and reinforcement learning.

DetailsMotivation: To address performance disparities in biomedical data science between commercial and open-source LLMs, and provide a cost-effective, privacy-preserving alternative to proprietary LLMs for developing biomedical coding assistants.

Method: Created executable sandbox environments with task specifications, interactive feedback, ground truth annotations, and scalable training trajectory generation. Used multi-threaded, multi-turn trajectory sampling for offline and online reinforcement learning.

Result: Med-Copilot achieved +43.02% (offline) and +45.28% (online) performance gains through reinforcement learning in MedAgentGym, becoming competitive with proprietary LLMs like GPT-4o.

Conclusion: MedAgentGym provides an effective, unified platform for developing LLM-based coding assistants for biomedical data science that is cost-effective and privacy-preserving while matching proprietary LLM performance.

Abstract: We introduce MedAgentGym, a scalable and interactive training environment designed to enhance coding-based biomedical reasoning capabilities in large language model (LLM) agents. MedAgentGym comprises 72,413 task instances across 129 categories derived from 12 authentic real-world biomedical scenarios. Tasks are encapsulated within executable sandbox environments, each featuring detailed task specifications, interactive feedback mechanisms, verifiable ground truth annotations, and scalable training trajectory generation. Extensive benchmarking of 29 LLMs reveals substantial performance disparities in biomedical data science between commercial and open-source LLMs. Leveraging efficient multi-threaded and multi-turn trajectory sampling in MedAgentGym, Med-Copilot achieves performance gains of +43.02% and +45.28% from offline and online reinforcement learning, respectively, demonstrating MedAgentGym as an effective training ground while establishing itself as a cost-effective, privacy-preserving alternative competitive with proprietary LLMs (gpt-4o). By offering a unified execution environment with a comprehensive benchmark and accessible, extensible training resources, MedAgentGym delivers an integrated platform to develop LLM-based coding assistants for advanced biomedical data science.

[164] SSA-COMET: Do LLMs Outperform Learned Metrics in Evaluating MT for Under-Resourced African Languages?

Senyu Li, Jiayi Wang, Felermino D. M. A. Ali, Colin Cherry, Daniel Deutsch, Eleftheria Briakou, Rui Sousa-Silva, Henrique Lopes Cardoso, Pontus Stenetorp, David Ifeoluwa Adelani

Main category: cs.CL

TL;DR: SSA-MTE is a large-scale human-annotated MT evaluation dataset for 14 African languages with over 73,000 annotations, enabling development of improved metrics SSA-COMET and SSA-COMET-QE that outperform existing approaches.

DetailsMotivation: Existing MT evaluation metrics have limited coverage for African languages and perform poorly in low-resource settings, with small evaluation sets and lack of tailored training data.

Method: Created SSA-MTE dataset with 73k+ sentence-level annotations across 14 African language pairs, then developed SSA-COMET and SSA-COMET-QE metrics, and benchmarked against LLMs like GPT-4o, Claude-3.7, and Gemini 2.5 Pro.

Result: SSA-COMET models significantly outperform AfriCOMET and are competitive with Gemini 2.5 Pro, especially on low-resource languages like Twi, Luo, and Yoruba.

Conclusion: The SSA-MTE dataset and improved metrics address critical gaps in African language MT evaluation, with all resources released openly to support future research.

Abstract: Evaluating machine translation (MT) quality for under-resourced African languages remains a significant challenge, as existing metrics often suffer from limited language coverage and poor performance in low-resource settings. While recent efforts, such as AfriCOMET, have addressed some of the issues, they are still constrained by small evaluation sets, a lack of publicly available training data tailored to African languages, and inconsistent performance in extremely low-resource scenarios. In this work, we introduce SSA-MTE, a large-scale human-annotated MT evaluation (MTE) dataset covering 14 African language pairs from the News domain, with over 73,000 sentence-level annotations from a diverse set of MT systems. Based on this data, we develop SSA-COMET and SSA-COMET-QE, improved reference-based and reference-free evaluation metrics. We also benchmark prompting-based approaches using state-of-the-art LLMs like GPT-4o, Claude-3.7 and Gemini 2.5 Pro. Our experimental results show that SSA-COMET models significantly outperform AfriCOMET and are competitive with the strongest LLM Gemini 2.5 Pro evaluated in our study, particularly on low-resource languages such as Twi, Luo, and Yoruba. All resources are released under open licenses to support future research.

[165] Micro-Act: Mitigating Knowledge Conflict in LLM-based RAG via Actionable Self-Reasoning

Nan Huo, Jinyang Li, Bowen Qin, Ge Qu, Xiaolong Li, Xiaodong Li, Chenhao Ma, Reynold Cheng

Main category: cs.CL

TL;DR: Micro-Act is a framework that addresses knowledge conflicts in RAG systems by adaptively decomposing knowledge sources into fine-grained comparisons, improving QA accuracy across multiple datasets and conflict types.

DetailsMotivation: RAG systems suffer from knowledge conflicts where retrieved external knowledge contradicts LLMs' parametric knowledge, adversely affecting QA performance. Existing approaches overwhelm LLMs with lengthy contexts, hindering their ability to identify and mitigate inconsistencies.

Method: Proposes Micro-Act with hierarchical action space that automatically perceives context complexity and adaptively decomposes each knowledge source into sequences of fine-grained comparisons represented as actionable steps, enabling reasoning beyond superficial context.

Result: Achieves significant QA accuracy improvements over state-of-the-art baselines across 5 datasets and 3 conflict types, especially excelling in temporal and semantic conflicts where all baselines fail. Also maintains robust performance on non-conflict questions.

Conclusion: Micro-Act demonstrates practical value for real-world RAG applications by effectively handling knowledge conflicts while maintaining performance on non-conflict scenarios, representing a significant advancement in RAG system robustness.

Abstract: Retrieval-Augmented Generation (RAG) systems commonly suffer from Knowledge Conflicts, where retrieved external knowledge contradicts the inherent, parametric knowledge of large language models (LLMs). It adversely affects performance on downstream tasks such as question answering (QA). Existing approaches often attempt to mitigate conflicts by directly comparing two knowledge sources in a side-by-side manner, but this can overwhelm LLMs with extraneous or lengthy contexts, ultimately hindering their ability to identify and mitigate inconsistencies. To address this issue, we propose Micro-Act a framework with a hierarchical action space that automatically perceives context complexity and adaptively decomposes each knowledge source into a sequence of fine-grained comparisons. These comparisons are represented as actionable steps, enabling reasoning beyond the superficial context. Through extensive experiments on five benchmark datasets, Micro-Act consistently achieves significant increase in QA accuracy over state-of-the-art baselines across all 5 datasets and 3 conflict types, especially in temporal and semantic types where all baselines fail significantly. More importantly, Micro-Act exhibits robust performance on non-conflict questions simultaneously, highlighting its practical value in real-world RAG applications.

[166] Query-Level Uncertainty in Large Language Models

Lihu Chen, Gerard de Melo, Fabian M. Suchanek, Gaël Varoquaux

Main category: cs.CL

TL;DR: A training-free method called Internal Confidence detects LLM knowledge boundaries using query-level uncertainty before token generation, enabling efficient adaptive inference like RAG and model cascading.

DetailsMotivation: LLMs need awareness of their knowledge boundaries to enable adaptive inference mechanisms (RAG, deep thinking, abstention) for developing efficient and trustworthy AI.

Method: Proposes Internal Confidence - a training-free method that leverages self-evaluations across layers and tokens to estimate query-level uncertainty before generating any tokens.

Result: Outperforms baselines in confidence quality while being computationally cheaper; reduces inference costs in RAG and model cascading while preserving performance.

Conclusion: Internal Confidence provides reliable uncertainty estimation for detecting knowledge boundaries, enabling cost-effective adaptive inference without training.

Abstract: It is important for Large Language Models (LLMs) to be aware of the boundary of their knowledge, distinguishing queries they can confidently answer from those that lie beyond their capabilities. Such awareness enables models to perform adaptive inference, such as invoking retrieval-augmented generation (RAG), engaging in slow and deep thinking, or abstaining from answering when appropriate. These mechanisms are key to developing efficient and trustworthy AI. In this work, we propose a method to detect knowledge boundaries via Query-Level Uncertainty, which estimates if a model is capable of answering a given query before generating any tokens, thus avoiding the generation cost. To this end, we propose a novel, training-free method called Internal Confidence, which leverages self-evaluations across layers and tokens to provide a reliable signal of uncertainty. Empirical studies on both factual question answering and mathematical reasoning tasks demonstrate that our Internal Confidence outperforms several baselines in quality of confidence while being computationally cheaper. Furthermore, we demonstrate its benefits in adaptive inference settings, showing that for RAG and model cascading it reduces inference costs while preserving overall performance.

[167] Infini-gram mini: Exact n-gram Search at the Internet Scale with FM-Index

Hao Xu, Jiacheng Liu, Yejin Choi, Noah A. Smith, Hannaneh Hajishirzi

Main category: cs.CL

TL;DR: Infini-gram mini is an efficient system that indexes and compresses large text corpora using FM-index, enabling search on petabyte-scale Internet text with only 44% storage overhead, and is used to detect benchmark contamination in language model training data.

DetailsMotivation: Understanding the massive text data from the Internet used to train language models is crucial, but existing exact-match search engines have high storage overhead that hinders application on Internet-scale data.

Method: Based on the FM-index data structure that simultaneously indexes and compresses text, creating indexes with size only 44% of the corpus, with significant improvements in indexing speed (18×) and memory use during indexing (3.2× reduction) and querying.

Result: Indexed 83TB of Internet text in 99 days with a single CPU node (or 19 hours with 137 nodes), found several core LM evaluation benchmarks heavily contaminated in Internet crawls (up to 74.2% in GSM8K), and created a benchmark contamination bulletin.

Conclusion: The system enables large-scale analysis of text corpora, revealing significant benchmark contamination that could lead to overestimating language model capabilities, and provides tools for general search queries on indexed data.

Abstract: Language models are trained mainly on massive text data from the Internet, and it becomes increasingly important to understand this data source. Exact-match search engines enable searching in large text corpora - counting string appearances and retrieving the enclosing documents - yet the high storage overhead hinders their application on Internet-scale data. We present infini-gram mini, an efficient and scalable system that can make petabyte-level text corpora searchable. Based on the FM-index data structure (Ferragina and Manzini, 2000), which simultaneously indexes and compresses text, our system creates indexes with size only 44% of the corpus. Infini-gram mini greatly improves upon the best existing implementation of FM-index in terms of indexing speed (18$\times$) and memory use during both indexing (3.2$\times$ reduction) and querying (down to a negligible amount). We index 83TB of Internet text in 99 days with a single CPU node with 128 vCPUs (or 19 hours if using 137 such nodes). We show one important use case of infini-gram mini in a large-scale analysis of benchmark contamination. We find several core LM evaluation benchmarks to be heavily contaminated in Internet crawls (up to 74.2% in GSM8K), which could lead to overestimating the capabilities of language models if trained on such data. We host a benchmark contamination bulletin to share the contamination rate of many core and community-contributed benchmarks. We also release a web interface and an API endpoint to serve general search queries on infini-gram mini indexes.

[168] LexiMark: Robust Watermarking via Lexical Substitutions to Enhance Membership Verification of an LLM’s Textual Training Data

Eyal German, Sagiv Antebi, Edan Habler, Asaf Shabtai, Yuval Elovici

Main category: cs.CL

TL;DR: LexiMark is a novel dataset watermarking technique that embeds synonym substitutions for high-entropy words to detect unauthorized LLM training, achieving better stealth and detection performance than existing methods.

DetailsMotivation: Existing dataset watermarking methods lack stealth and are easily detectable/removable, making it challenging to verify if LLMs were trained on unauthorized data.

Method: Embed synonym substitutions for carefully selected high-entropy words to enhance LLM memorization without altering semantic integrity, making the watermark difficult to detect and remove.

Result: Significant improvements in AUROC scores across multiple training settings (continued pretraining and fine-tuning) on seven open-source LLMs compared to existing methods.

Conclusion: LexiMark effectively verifies unauthorized use of watermarked data in LLM training through subtle, contextually appropriate substitutions that evade detection.

Abstract: Large language models (LLMs) can be trained or fine-tuned on data obtained without the owner’s consent. Verifying whether a specific LLM was trained on particular data instances or an entire dataset is extremely challenging. Dataset watermarking addresses this by embedding identifiable modifications in training data to detect unauthorized use. However, existing methods often lack stealth, making them relatively easy to detect and remove. In light of these limitations, we propose LexiMark, a novel watermarking technique designed for text and documents, which embeds synonym substitutions for carefully selected high-entropy words. Our method aims to enhance an LLM’s memorization capabilities on the watermarked text without altering the semantic integrity of the text. As a result, the watermark is difficult to detect, blending seamlessly into the text with no visible markers, and is resistant to removal due to its subtle, contextually appropriate substitutions that evade automated and manual detection. We evaluated our method using baseline datasets from recent studies and seven open-source models: LLaMA-1 7B, LLaMA-3 8B, Mistral 7B, Pythia 6.9B, as well as three smaller variants from the Pythia family (160M, 410M, and 1B). Our evaluation spans multiple training settings, including continued pretraining and fine-tuning scenarios. The results demonstrate significant improvements in AUROC scores compared to existing methods, underscoring our method’s effectiveness in reliably verifying whether unauthorized watermarked data was used in LLM training.

[169] Using cognitive models to reveal value trade-offs in language models

Sonia K. Murthy, Rosie Zhao, Jennifer Hu, Sham Kakade, Markus Wulfmeier, Peng Qian, Tomer Ullman

Main category: cs.CL

TL;DR: The paper develops a framework using cognitive models to analyze value trade-offs in LLMs, examining reasoning effort and RL training dynamics, revealing patterns of utility preferences and their modifiability.

DetailsMotivation: Current tools are limited for interpreting dynamic value trade-offs in LLMs, while cognitive science offers formal models to understand how humans weight competing utilities in decision-making.

Method: Used a cognitive model of polite speech to systematically evaluate value trade-offs across two settings: reasoning effort in frontier models and RL post-training dynamics in open-source models.

Result: Found higher informational utility than social utility in reasoning models’ default behavior, with predictable shifts when prioritizing goals. Training dynamics showed large early utility shifts with persistent base model/pretraining effects.

Conclusion: The framework provides a flexible tool for probing value trade-offs across model types, useful for hypothesis generation about social behaviors and shaping training regimes to better control value trade-offs.

Abstract: Value trade-offs are an integral part of human decision-making and language use, however, current tools for interpreting such dynamic and multi-faceted notions of values in LLMs are limited. In cognitive science, so-called “cognitive models” provide formal accounts of such trade-offs in humans, by modeling the weighting of a speaker’s competing utility functions in choosing an action or utterance. Here we use a leading cognitive model of polite speech to systematically evaluate value trade-offs in two encompassing model settings: degrees of reasoning “effort” in frontier black-box models, and RL post-training dynamics of open-source models. Our results highlight patterns of higher informational utility than social utility in reasoning models’ default behavior, and demonstrate that these patterns shift in predictable ways when models are prompted to prioritize certain goals over others. Our findings from LLMs’ training dynamics suggest large shifts in utility values early on in training with persistent effects of the choice of base model and pretraining data, compared to feedback dataset or alignment method. Our framework offers a flexible tool for probing value trade-offs across diverse model types, providing insights for generating hypotheses about other social behaviors such as sycophancy and for shaping training regimes that better control trade-offs between values during model development.

[170] Self-Correction Bench: Uncovering and Addressing the Self-Correction Blind Spot in Large Language Models

Ken Tsui

Main category: cs.CL

TL;DR: LLMs have a systematic failure called Self-Correction Blind Spot where they cannot correct their own errors but can correct identical errors from external sources, with an average 64.5% blind spot rate across 14 models.

DetailsMotivation: Self-correction capability is essential for deploying LLMs in safety-critical applications, but current models have systematic limitations in correcting their own errors.

Method: Introduced Self-Correction Bench evaluation framework with controlled error injection at three complexity levels, tested 14 open-source non-reasoning models, and analyzed training data influences.

Result: Found average 64.5% blind spot rate; discovered that appending a minimal “Wait” prompt activates 89.3% reduction in blind spots, suggesting dormant capabilities.

Conclusion: LLMs have critical self-correction limitations potentially influenced by training distribution, but practical approaches like minimal prompts can significantly enhance reliability for safety-critical domains.

Abstract: Although large language models (LLMs) have transformed AI, they still make mistakes and can explore unproductive reasoning paths. Self-correction capability is essential for deploying LLMs in safety-critical applications. We uncover a systematic failure: LLMs cannot correct errors in their own outputs while successfully correcting identical errors from external sources - a limitation we term the Self-Correction Blind Spot. To study this phenomenon, we introduce Self-Correction Bench, an evaluation framework to measure this phenomenon through controlled error injection at three complexity levels. Testing 14 open-source non-reasoning models, we find an average 64.5% blind spot rate. We provide multiple lines of evidence suggesting this limitation may be influenced by training data: human demonstrations rarely include error-correction sequences (favoring error-free responses), whereas reinforcement learning (RL) trained models learn error correction via outcome feedback. Remarkably, appending a minimal “Wait” prompt activates a 89.3% reduction in blind spots, suggesting dormant capabilities that require triggering. Our work highlights a critical limitation potentially influenced by training distribution and offers a practical approach to enhance LLM reliability and trustworthiness - vital for safety-critical domains.

[171] Empowering Healthcare Practitioners with Language Models: Structuring Speech Transcripts in Two Real-World Clinical Applications

Jean-Philippe Corbeil, Asma Ben Abacha, George Michalopoulos, Phillip Swazinna, Miguel Del-Agua, Jerome Tremblay, Akila Jeeson Daniel, Cari Bader, Yu-Cheng Cho, Pooja Krishnan, Nathan Bodenstab, Thomas Lin, Wenxuan Teng, Francois Beaulieu, Paul Vozila

Main category: cs.CL

TL;DR: This paper investigates LLM performance on two underexplored clinical NLP tasks: structured reporting from nurse dictations and medical order extraction from consultations, using both private and open-source datasets.

DetailsMotivation: To address data scarcity and sensitivity issues in high-impact clinical NLP tasks that could reduce healthcare documentation burden and allow providers to focus more on patient care.

Method: Evaluated open- and closed-weight LLMs on private and open-source clinical datasets, and proposed an agentic pipeline for generating realistic, non-sensitive nurse dictations to enable structured extraction of clinical observations.

Result: The study provides performance analysis of LLMs on these tasks and releases SYNUR and SIMORD - the first open-source datasets for nurse observation extraction and medical order extraction to support further research.

Conclusion: The research advances clinical NLP by addressing data scarcity through novel datasets and demonstrates LLM capabilities on practical clinical documentation tasks that can reduce healthcare provider burden.

Abstract: Large language models (LLMs) such as GPT-4o and o1 have demonstrated strong performance on clinical natural language processing (NLP) tasks across multiple medical benchmarks. Nonetheless, two high-impact NLP tasks - structured tabular reporting from nurse dictations and medical order extraction from doctor-patient consultations - remain underexplored due to data scarcity and sensitivity, despite active industry efforts. Practical solutions to these real-world clinical tasks can significantly reduce the documentation burden on healthcare providers, allowing greater focus on patient care. In this paper, we investigate these two challenging tasks using private and open-source clinical datasets, evaluating the performance of both open- and closed-weight LLMs, and analyzing their respective strengths and limitations. Furthermore, we propose an agentic pipeline for generating realistic, non-sensitive nurse dictations, enabling structured extraction of clinical observations. To support further research in both areas, we release SYNUR and SIMORD, the first open-source datasets for nurse observation extraction and medical order extraction.

[172] Psychometric Item Validation Using Virtual Respondents with Trait-Response Mediators

Sungjib Lim, Woojung Song, Eun-Ju Lee, Yohan Jo

Main category: cs.CL

TL;DR: A framework for virtual respondent simulation using LLMs to efficiently validate psychometric survey items by accounting for mediators - factors that cause varying responses to the same trait.

DetailsMotivation: Traditional psychometric survey validation requires costly human data collection, creating a need for scalable item generation methods that ensure construct validity for LLM assessment.

Method: Simulate virtual respondents with diverse mediators (factors that influence how traits manifest in responses) to identify survey items that robustly measure intended traits across different mediator profiles.

Result: Experiments on three psychological trait theories (Big5, Schwartz, VIA) show the mediator generation methods and simulation framework effectively identify high-validity items, with LLMs demonstrating ability to generate plausible mediators and simulate respondent behavior.

Conclusion: The framework enables cost-effective survey development and provides insights into how LLMs simulate human survey responses, with publicly released dataset and code to support future research.

Abstract: As psychometric surveys are increasingly used to assess the traits of large language models (LLMs), the need for scalable survey item generation suited for LLMs has also grown. A critical challenge here is ensuring the construct validity of generated items, i.e., whether they truly measure the intended trait. Traditionally, this requires costly, large-scale human data collection. To make it efficient, we present a framework for virtual respondent simulation using LLMs. Our central idea is to account for mediators: factors through which the same trait can give rise to varying responses to a survey item. By simulating respondents with diverse mediators, we identify survey items that robustly measure intended traits. Experiments on three psychological trait theories (Big5, Schwartz, VIA) show that our mediator generation methods and simulation framework effectively identify high-validity items. LLMs demonstrate the ability to generate plausible mediators from trait definitions and to simulate respondent behavior for item validation. Our problem formulation, metrics, methodology, and dataset open a new direction for cost-effective survey development and a deeper understanding of how LLMs simulate human survey responses. We publicly release our dataset and code to support future work.

[173] MapIQ: Evaluating Multimodal Large Language Models for Map Question Answering

Varun Srivastava, Fan Lei, Srija Mukhopadhyay, Vivek Gupta, Ross Maciejewski

Main category: cs.CL

TL;DR: Introduces MapIQ, a comprehensive benchmark for evaluating multimodal LLMs on map visual question answering across three map types and six themes, revealing performance gaps and design sensitivity.

DetailsMotivation: Existing Map-VQA research focuses mainly on choropleth maps with limited thematic coverage and analytical tasks, creating a need for broader evaluation.

Method: Created MapIQ dataset with 14,706 QA pairs across choropleth maps, cartograms, and proportional symbol maps covering six themes, then evaluated multiple MLLMs on six visual analytical tasks and tested robustness to design changes.

Result: MLLMs showed varying performance across map types and tasks, with sensitivity to design changes like color schemes and legend modifications, revealing reliance on internal geographic knowledge.

Conclusion: The study identifies limitations in current MLLMs for Map-VQA, provides insights into their robustness, and suggests directions for improving map comprehension capabilities.

Abstract: Recent advancements in multimodal large language models (MLLMs) have driven researchers to explore how well these models read data visualizations, e.g., bar charts, scatter plots. More recently, attention has shifted to visual question answering with maps (Map-VQA). However, Map-VQA research has primarily focused on choropleth maps, which cover only a limited range of thematic categories and visual analytical tasks. To address these gaps, we introduce MapIQ, a benchmark dataset comprising 14,706 question-answer pairs across three map types: choropleth maps, cartograms, and proportional symbol maps spanning topics from six distinct themes (e.g., housing, crime). We evaluate multiple MLLMs using six visual analytical tasks, comparing their performance against one another and a human baseline. An additional experiment examining the impact of map design changes (e.g., altered color schemes, modified legend designs, and removal of map elements) provides insights into the robustness and sensitivity of MLLMs, their reliance on internal geographic knowledge, and potential avenues for improving Map-VQA performance.

[174] WakenLLM: Evaluating Reasoning Potential and Stability in LLMs via Fine-Grained Benchmarking

Zipeng Ling, Yuehao Tang, Shuliang Liu, Junqi Yang, Shenghong Fu, Chen Huang, Kejia Huang, Yao Wan, Zhichao Hou, Xuming Hu

Main category: cs.CL

TL;DR: WakenLLM is a framework that quantifies Unknown outputs in LLMs, distinguishing between genuine unverifiable problems and model incapacity, and shows that guided understanding can improve accuracy by up to 68.53% without training.

DetailsMotivation: Current evaluations focus on whether Unknown outputs are honest rather than analyzing LLM reasoning limits. The Vague Perception phenomenon occurs when LLMs output Unknown either for genuinely unverifiable samples or for verifiable problems they fail to solve.

Method: Introduces WakenLLM framework that quantifies Unknown outputs attributable to model incapacity and evaluates whether stimulation can convert them into correct answers (verifiable) or justified responses with valid reasoning (unverifiable).

Result: Comprehensive experiments on six LLMs show that without training or parameter revision, LLMs can achieve up to 68.53% accuracy improvement on Vague Perception samples through guided understanding.

Conclusion: Current baseline methods only activate a small portion of LLMs’ reasoning potential, indicating considerable unexplored capacity. This extends theoretical upper bounds of reasoning accuracy and deepens understanding of latent reasoning capacity in LLMs.

Abstract: Large Language Models (LLMs) frequently output the label Unknown in reasoning tasks, where two scenarios may appear: (i) an input sample is genuinely unverifiable, but the model cannot understand why; and (ii) a verifiable problem that the model fails to solve, thus outputs Unknown. We refer to these cases collectively as the Vague Perception phenomenon. Current evaluations focus on whether such answers are honest, rather than analyzing the limits of LLM reasoning. To address this, we introduce WakenLLM, a framework that quantifies the portion of Unknown output attributable to model incapacity and evaluates whether stimulation can convert them into either correct answers (verifiable) or justified (unverifiable) responses with valid reasoning. Our method offers a clearer picture of the limits of LLM reasoning and the potential for corrections across various datasets. Comprehensive experiments on six LLMs suggest that, without any training or parameter revision, LLMs can achieve up to a 68.53% accuracy improvement on Vague Perception samples through guided understanding. Our work reveals that current baseline methods only activate a small portion of LLMs’ reasoning potential, indicating considerable unexplored capacity. This extends the theoretical upper bounds of reasoning accuracy in LLMs. Consequently, this study deepens our understanding of the latent reasoning capacity of LLMs and offers a new perspective on addressing the Vague Perception phenomenon.

[175] Towards Enforcing Company Policy Adherence in Agentic Workflows

Naama Zwerdling, David Boaz, Ella Rabinovich, Guy Uziel, David Amid, Ateret Anaby-Tavor

Main category: cs.CL

TL;DR: A framework for enforcing business policy adherence in LLM agents through offline compilation of policies into verifiable guard code and runtime integration to ensure compliance.

DetailsMotivation: LLM agents struggle to reliably follow complex company policies, limiting their effectiveness for business process automation.

Method: Two-phase approach: (1) offline buildtime stage compiling policy documents into verifiable guard code for tools, (2) runtime integration where guards ensure compliance before each agent action.

Result: Demonstrated on the τ-bench Airlines domain with encouraging preliminary results in policy enforcement.

Conclusion: The framework provides deterministic, transparent, and modular policy enforcement, though key challenges remain for real-world deployments.

Abstract: Large Language Model (LLM) agents hold promise for a flexible and scalable alternative to traditional business process automation, but struggle to reliably follow complex company policies. In this study we introduce a deterministic, transparent, and modular framework for enforcing business policy adherence in agentic workflows. Our method operates in two phases: (1) an offline buildtime stage that compiles policy documents into verifiable guard code associated with tool use, and (2) a runtime integration where these guards ensure compliance before each agent action. We demonstrate our approach on the challenging $\tau$-bench Airlines domain, showing encouraging preliminary results in policy enforcement, and further outline key challenges for real-world deployments.

[176] MAGIC: A Multi-Hop and Graph-Based Benchmark for Inter-Context Conflicts in Retrieval-Augmented Generation

Jungyeon Lee, Kangmin Lee, Taeuk Kim

Main category: cs.CL

TL;DR: A knowledge graph-based framework called MAGIC is proposed to generate varied and subtle knowledge conflicts for evaluating retrieval-augmented generation systems, revealing that LLMs struggle with conflict detection and identifying contradiction sources.

DetailsMotivation: Existing benchmarks for knowledge conflict in RAG systems have limitations including narrow focus on question answering, heavy reliance on entity substitution, and restricted conflict types.

Method: Proposed a knowledge graph-based framework that generates varied and subtle conflicts between two similar yet distinct contexts using explicit relational structure of KGs.

Result: Both open-source and proprietary models struggle with conflict detection, especially in multi-hop reasoning scenarios, and often fail to pinpoint exact sources of contradictions.

Conclusion: The MAGIC benchmark provides insights into LLM behavior with knowledge conflicts and serves as a foundation for improving LLMs’ ability to integrate diverse and conflicting information.

Abstract: Knowledge conflict often arises in retrieval-augmented generation (RAG) systems, where retrieved documents may be inconsistent with one another or contradict the model’s parametric knowledge. Existing benchmarks for investigating the phenomenon have notable limitations, including a narrow focus on the question answering setup, heavy reliance on entity substitution techniques, and a restricted range of conflict types. To address these issues, we propose a knowledge graph (KG)-based framework that generates varied and subtle conflicts between two similar yet distinct contexts, while ensuring interpretability through the explicit relational structure of KGs. Experimental results on our benchmark, MAGIC, provide intriguing insights into the inner workings of LLMs regarding knowledge conflict: both open-source and proprietary models struggle with conflict detection – especially when multi-hop reasoning is required – and often fail to pinpoint the exact source of contradictions. Finally, we present in-depth analyses that serve as a foundation for improving LLMs in integrating diverse, sometimes even conflicting, information.

[177] Adversarial Defence without Adversarial Defence: Enhancing Language Model Robustness via Instance-level Principal Component Removal

Yang Wang, Chenghao Xiao, Yizhi Li, Stuart E. Middleton, Noura Al Moubayed, Chenghua Lin

Main category: cs.CL

TL;DR: Proposes a simple add-on module that enhances adversarial robustness of pre-trained language models by removing instance-level principal components, transforming the embedding space to approximate Gaussian properties without conventional adversarial defenses or data perturbation.

DetailsMotivation: PLMs are vulnerable to adversarial attacks, and existing defense methods incur high computational costs through adversarial training or data augmentation.

Method: Remove instance-level principal components from embeddings to transform the space to approximate Gaussian properties, reducing susceptibility to adversarial perturbations while preserving semantic relationships.

Result: Evaluations on eight benchmark datasets show improved adversarial robustness while maintaining comparable before-attack accuracy to baselines.

Conclusion: The approach achieves a balanced trade-off between robustness and generalization without requiring adversarial examples or costly training-time augmentation.

Abstract: Pre-trained language models (PLMs) have driven substantial progress in natural language processing but remain vulnerable to adversarial attacks, raising concerns about their robustness in real-world applications. Previous studies have sought to mitigate the impact of adversarial attacks by introducing adversarial perturbations into the training process, either implicitly or explicitly. While both strategies enhance robustness, they often incur high computational costs. In this work, we propose a simple yet effective add-on module that enhances the adversarial robustness of PLMs by removing instance-level principal components, without relying on conventional adversarial defences or perturbing the original training data. Our approach transforms the embedding space to approximate Gaussian properties, thereby reducing its susceptibility to adversarial perturbations while preserving semantic relationships. This transformation aligns embedding distributions in a way that minimises the impact of adversarial noise on decision boundaries, enhancing robustness without requiring adversarial examples or costly training-time augmentation. Evaluations on eight benchmark datasets show that our approach improves adversarial robustness while maintaining comparable before-attack accuracy to baselines, achieving a balanced trade-off between robustness and generalisation.

[178] C3: A Bilingual Benchmark for Spoken Dialogue Models Exploring Challenges in Complex Conversations

Chengqian Ma, Wei Tao, Yiwen Guo

Main category: cs.CL

TL;DR: This paper introduces a benchmark dataset for evaluating Spoken Dialogue Models (SDMs) to address the lack of comprehensive understanding of their practical effectiveness in human conversations compared to text-based LLMs.

DetailsMotivation: There's a research gap in understanding SDMs' practical effectiveness in comprehending and emulating human conversations, especially compared to extensively benchmarked text-based LLMs. Human voice interactions are more complex due to ambiguity and context-dependency challenges.

Method: The authors present a benchmark dataset comprising 1,079 instances in English and Chinese, accompanied by an LLM-based evaluation method that closely aligns with human judgment.

Result: The benchmark facilitates comprehensive exploration of SDM performance in tackling practical challenges of spoken dialogue, including ambiguity from semantic and phonological factors, and context-dependency issues.

Conclusion: The proposed benchmark dataset and evaluation method address the current gap in SDM research and enable better understanding of their capabilities in handling complex human conversational dynamics.

Abstract: Spoken Dialogue Models (SDMs) have recently attracted significant attention for their ability to generate voice responses directly to users’ spoken queries. Despite their increasing popularity, there exists a gap in research focused on comprehensively understanding their practical effectiveness in comprehending and emulating human conversations. This is especially true compared to text-based Large Language Models (LLMs), which benefit from extensive benchmarking. Human voice interactions are inherently more complex than text due to characteristics unique to spoken dialogue. Ambiguity poses one challenge, stemming from semantic factors like polysemy, as well as phonological aspects such as heterograph, heteronyms, and stress patterns. Additionally, context-dependency, like omission, coreference, and multi-turn interaction, adds further complexity to human conversational dynamics. To illuminate the current state of SDM development and to address these challenges, we present a benchmark dataset in this paper, which comprises 1,079 instances in English and Chinese. Accompanied by an LLM-based evaluation method that closely aligns with human judgment, this dataset facilitates a comprehensive exploration of the performance of SDMs in tackling these practical challenges.

[179] User Feedback in Human-LLM Dialogues: A Lens to Understand Users But Noisy as a Learning Signal

Yuhan Liu, Michael J. Q. Zhang, Eunsol Choi

Main category: cs.CL

TL;DR: Study explores harvesting implicit user feedback from user-LM interaction logs to improve models without disruptive direct feedback, analyzing when feedback occurs and testing if incorporating feedback content (beyond polarity) improves performance.

DetailsMotivation: Language models deployed in real-world settings need to evolve based on user feedback, but asking for direct feedback disrupts user experience. Implicit feedback from interaction logs provides a non-disruptive alternative for model improvement.

Method: Analyzed two user-LM interaction datasets (WildChat and LMSYS) to understand when and why feedback occurs. Tested whether incorporating feedback content (e.g., user wanted clarification) along with polarity improves model performance on different benchmarks.

Result: Mixed results - incorporating feedback content helped performance on short human-designed questions (MTBench) but not on longer, more complex questions (WildBench).

Conclusion: Implicit user feedback from interaction logs shows potential for model improvement but has limitations, particularly with complex tasks. The approach works better for simpler, shorter interactions than complex ones.

Abstract: Once language models (LMs) are deployed, they can interact with users long-term, ideally evolving based on their feedback. Asking for direct user feedback can be disruptive; thus, we study harvesting implicit user feedback from user-LM interaction logs. We study two user-LM interaction datasets (WildChat and LMSYS). First, we analyze user feedback in the user-LLM conversation logs, providing insights into when and why such feedback occurs. Second, we study harvesting learning signals from such implicit user feedback. Specifically, we study whether incorporating the contents of user feedback (e.g., user wanted clarification), in addition to the polarity of the feedback, can improve the model performance. We observe mixed results, showing this helps in short human-designed questions (MTBench) but not on longer and more complex questions (WildBench). Together, we provide an in-depth study of implicit user feedback, showing its potential and limitations.

[180] XAutoLM: Efficient Fine-Tuning of Language Models via Meta-Learning and AutoML

Ernesto L. Estevanell-Valladares, Suilan Estevez-Velarde, Yoan Gutiérrez, Andrés Montoyo, Ruslan Mitkov

Main category: cs.CL

TL;DR: XAutoLM is a meta-learning AutoML framework that optimizes LM fine-tuning by reusing past experiences to reduce computational costs and improve performance.

DetailsMotivation: Current automated frameworks don't fully address model selection and HPO for efficient LM fine-tuning, which is computationally expensive and environmentally impactful.

Method: Uses meta-learning to extract task- and system-level meta-features from past successes/failures, biasing sampling toward valuable configurations and away from costly dead ends.

Result: Surpassed zero-shot optimizer’s peak F1 on 5/6 tasks, cut evaluation time by up to 4.5x, reduced search error ratios by 7x, and found 50% more pipelines above the zero-shot Pareto front.

Conclusion: XAutoLM enables resource-efficient Green AI fine-tuning, outperforming simpler memory-based approaches that suffer from negative transfer.

Abstract: Experts in machine learning leverage domain knowledge to navigate decisions in model selection, hyperparameter optimization, and resource allocation. This is particularly critical for fine-tuning language models (LMs), where repeated trials incur substantial computational overhead and environmental impact. However, no existing automated framework simultaneously tackles the entire model selection and hyperparameter optimization (HPO) task for resource-efficient LM fine-tuning. We introduce XAutoLM, a meta-learning-augmented AutoML framework that reuses past experiences to optimize discriminative and generative LM fine-tuning pipelines efficiently. XAutoLM learns from stored successes and failures by extracting task- and system-level meta-features to bias its sampling toward valuable configurations and away from costly dead ends. On four text classification and two question-answering benchmarks, XAutoLM surpasses zero-shot optimizer’s peak F1 on five of six tasks, cuts mean evaluation time of pipelines by up to 4.5x, reduces search error ratios by up to sevenfold, and uncovers up to 50% more pipelines above the zero-shot Pareto front. In contrast, simpler memory-based baselines suffer negative transfer. We release XAutoLM and our experience store to catalyze resource-efficient, Green AI fine-tuning in the NLP community.

[181] LPI-RIT at LeWiDi-2025: Improving Distributional Predictions via Metadata and Loss Reweighting with DisCo

Mandira Sawkar, Samay U. Shetty, Deepak Pandita, Tharindu Cyril Weerasooriya, Christopher M. Homan

Main category: cs.CL

TL;DR: The paper presents improvements to the DisCo neural architecture for modeling annotator disagreement in the LeWiDi 2025 shared task, achieving better performance on soft label distribution prediction and perspectivist evaluation.

DetailsMotivation: To better model annotator disagreement through soft label distribution prediction and perspectivist evaluation, which focuses on individual annotators rather than aggregated labels.

Method: Extended DisCo architecture by introducing annotator metadata embeddings, enhanced input representations, and multi-objective training losses to better capture disagreement patterns.

Result: Substantial improvements in both soft and perspectivist evaluation metrics across three datasets, with better calibration and understanding of when disagreement-aware modeling works best.

Conclusion: Disagreement can be better captured by conditioning on annotator demographics and directly optimizing for distributional metrics, yielding consistent improvements across datasets.

Abstract: The Learning With Disagreements (LeWiDi) 2025 shared task aims to model annotator disagreement through soft label distribution prediction and perspectivist evaluation, which focuses on modeling individual annotators. We adapt DisCo (Distribution from Context), a neural architecture that jointly models item-level and annotator-level label distributions, and present detailed analysis and improvements. In this paper, we extend DisCo by introducing annotator metadata embeddings, enhancing input representations, and multi-objective training losses to capture disagreement patterns better. Through extensive experiments, we demonstrate substantial improvements in both soft and perspectivist evaluation metrics across three datasets. We also conduct in-depth calibration and error analyses that reveal when and why disagreement-aware modeling improves. Our findings show that disagreement can be better captured by conditioning on annotator demographics and by optimizing directly for distributional metrics, yielding consistent improvements across datasets.

[182] Time Is a Feature: Exploiting Temporal Dynamics in Diffusion Language Models

Wen Wang, Bozhen Fang, Chenchen Jing, Yongliang Shen, Yangyi Shen, Qiuyu Wang, Hao Ouyang, Hao Chen, Chunhua Shen

Main category: cs.CL

TL;DR: This paper introduces methods to address temporal oscillation in diffusion LLMs, where correct answers appear mid-process but get overwritten. It proposes Temporal Self-Consistency Voting and Temporal Consistency Reinforcement using Temporal Semantic Entropy to improve generation stability.

DetailsMotivation: Current diffusion LLM decoding strategies discard intermediate predictions, missing correct answers that emerge during denoising but get overwritten later due to temporal oscillation.

Method: Two methods: 1) Temporal Self-Consistency Voting - training-free decoding that aggregates predictions across denoising steps; 2) Temporal Consistency Reinforcement - post-training using Temporal Semantic Entropy as reward signal to encourage stable generations.

Result: Significant improvements: 24.7% average improvement on Countdown using negative TSE reward alone; combined with accuracy reward: 2.0% on GSM8K, 4.3% on MATH500, 6.6% on SVAMP, and 25.3% on Countdown.

Conclusion: Temporal dynamics in dLLMs have untapped potential, and the proposed methods offer simple yet effective tools to harness intermediate predictions for improved generation quality and consistency.

Abstract: Diffusion large language models (dLLMs) generate text through iterative denoising, yet current decoding strategies discard rich intermediate predictions in favor of the final output. Our work here reveals a critical phenomenon, temporal oscillation, where correct answers often emerge in the middle process, but are overwritten in later denoising steps. To address this issue, we introduce two complementary methods that exploit temporal consistency: 1) Temporal Self-Consistency Voting, a training-free, test-time decoding strategy that aggregates predictions across denoising steps to select the most consistent output; and 2) a post-training method termed Temporal Consistency Reinforcement, which uses Temporal Semantic Entropy (TSE), a measure of semantic stability across intermediate predictions, as a reward signal to encourage stable generations. Empirical results across multiple benchmarks demonstrate the effectiveness of our approach. Using the negative TSE reward alone, we observe a remarkable average improvement of 24.7% on the Countdown dataset over an existing dLLM. Combined with the accuracy reward, we achieve absolute gains of 2.0% on GSM8K, 4.3% on MATH500, 6.6% on SVAMP, and 25.3% on Countdown, respectively. Our findings underscore the untapped potential of temporal dynamics in dLLMs and offer two simple yet effective tools to harness them.

[183] Flow-SLM: Joint Learning of Linguistic and Acoustic Information for Spoken Language Modeling

Ju-Chieh Chou, Jiawei Zhou, Karen Livescu

Main category: cs.CL

TL;DR: Proposes a textless spoken language model that jointly models linguistic and acoustic information by generating semantic tokens and continuous acoustic representations using flow-matching, improving acoustic detail in speech generation.

DetailsMotivation: Existing textless SLMs only predict semantic tokens and rely on separate vocoders, lacking acoustic context and control over acoustic details.

Method: Jointly model linguistic and acoustic information by generating semantic tokens and continuous acoustic representations using flow-matching objective, predicting multiple future semantic tokens to preserve linguistic information.

Result: Achieves comparable linguistic performance to existing models while providing better acoustic detail in prompted generation.

Conclusion: Joint modeling of linguistic and acoustic information with flow-matching improves acoustic detail in textless spoken language models while maintaining linguistic quality.

Abstract: Textless spoken language models (SLMs) are generative models of speech that do not rely on text supervision. Most textless SLMs learn to predict the next semantic token, a discrete representation of linguistic content, and rely on a separate vocoder to add acoustic information to the generated speech. Such models have no access to acoustic context and no built-in control over acoustic details. In this work, we propose to jointly model linguistic and acoustic information by generating semantic tokens and a continuous real-valued representation of the acoustic frame. We use a flow-matching objective to predict the continuous vector conditioned on the semantic tokens. We study the design space of this approach and find that predicting multiple future semantic tokens helps preserve linguistic information. Our approach achieves comparable performance to existing models in terms of linguistic likelihood benchmarks, while providing better acoustic detail in prompted generation.

[184] A Stitch in Time Saves Nine: Proactive Self-Refinement for Language Models

Jinyi Han, Xinyi Wang, Haiquan Zhao, Tingyun li, Zishang Jiang, Sihang Jiang, Jiaqing Liang, Xin Lin, Weikang Zhou, Zeye Sun, Fei Yu, Yanghua Xiao

Main category: cs.CL

TL;DR: ProActive Self-Refinement (PASR) enables LLMs to dynamically refine outputs during generation, reducing token usage by 41.6% while improving accuracy by 8.2% compared to standard methods.

DetailsMotivation: Existing self-refinement methods use fixed iterations and reactive processes, lacking dynamic adaptation to evolving generation context and optimal refinement timing.

Method: PASR enables LLMs to proactively decide whether, when, and how to refine based on internal state and evolving context during generation, rather than regenerating entire responses.

Result: On Qwen3-8B, PASR reduces average token consumption by 41.6% and achieves 8.2% accuracy improvement across 10 diverse tasks compared to standard generation.

Conclusion: PASR demonstrates that proactive, context-aware refinement during generation significantly enhances LLM performance while reducing computational costs.

Abstract: Recent advances in self-refinement have demonstrated significant potential for improving the outputs of large language models (LLMs) through iterative refinement. However, most existing self-refinement methods rely on a reactive process with a fixed number of iterations, making it difficult to determine the optimal timing and content of refinement based on the evolving generation context. Inspired by the way humans dynamically refine their thoughts during execution, we propose ProActive Self-Refinement (PASR), a novel method that enables LLMs to refine their outputs during the generation process. Unlike methods that regenerate entire responses, PASR proactively decides whether, when, and how to refine based on the model’s internal state and evolving context. We conduct extensive experiments on a diverse set of 10 tasks to evaluate the effectiveness of PASR. Experimental results show that PASR significantly enhances problem-solving performance. In particular, on Qwen3-8B, PASR reduces average token consumption by 41.6% compared to standard generation, while also achieving an 8.2% improvement in accuracy. Our code and baselines used in the paper are available on GitHub.

[185] AutoBnB-RAG: Enhancing Multi-Agent Incident Response with Retrieval-Augmented Generation

Zefang Liu, Arman Anwar

Main category: cs.CL

TL;DR: AutoBnB-RAG extends the AutoBnB framework by incorporating retrieval-augmented generation (RAG) into multi-agent incident response simulations, improving decision quality and success rates in cybersecurity scenarios.

DetailsMotivation: Current LLM-based autonomous agents in incident response lack access to external knowledge, limiting their reasoning capabilities during cyber threat containment and mitigation.

Method: Extends AutoBnB framework with RAG in multi-agent simulations using Backdoors & Breaches game environment. Introduces two retrieval settings: RAG-Wiki (technical documentation) and RAG-News (narrative incident reports). Evaluates eight team structures including argumentative configurations for critical reasoning.

Result: Retrieval augmentation improves decision quality and success rates across diverse organizational models. The system demonstrates ability to reconstruct complex multi-stage attacks based on public breach reports.

Conclusion: Integrating retrieval mechanisms into LLM-based multi-agent systems provides significant value for cybersecurity decision-making by enhancing reasoning capabilities with external knowledge.

Abstract: Incident response (IR) requires fast, coordinated, and well-informed decision-making to contain and mitigate cyber threats. While large language models (LLMs) have shown promise as autonomous agents in simulated IR settings, their reasoning is often limited by a lack of access to external knowledge. In this work, we present AutoBnB-RAG, an extension of the AutoBnB framework that incorporates retrieval-augmented generation (RAG) into multi-agent incident response simulations. Built on the Backdoors & Breaches (B&B) tabletop game environment, AutoBnB-RAG enables agents to issue retrieval queries and incorporate external evidence during collaborative investigations. We introduce two retrieval settings: one grounded in curated technical documentation (RAG-Wiki), and another using narrative-style incident reports (RAG-News). We evaluate performance across eight team structures, including newly introduced argumentative configurations designed to promote critical reasoning. To validate practical utility, we also simulate real-world cyber incidents based on public breach reports, demonstrating AutoBnB-RAG’s ability to reconstruct complex multi-stage attacks. Our results show that retrieval augmentation improves decision quality and success rates across diverse organizational models. This work demonstrates the value of integrating retrieval mechanisms into LLM-based multi-agent systems for cybersecurity decision-making.

[186] OptimalThinkingBench: Evaluating Over and Underthinking in LLMs

Pranjal Aggarwal, Seungone Kim, Jack Lanchantin, Sean Welleck, Jason Weston, Ilia Kulikov, Swarnadeep Saha

Main category: cs.CL

TL;DR: OptimalThinkingBench is a unified benchmark that evaluates both overthinking and underthinking in LLMs, showing that current models fail to optimally balance performance and efficiency across different task complexities.

DetailsMotivation: Current LLMs either overthink on simple tasks (wasting compute) or underthink on complex reasoning problems (missing solutions), requiring users to manually select between thinking and non-thinking model variants.

Method: Created two sub-benchmarks: OverthinkingBench with 72 domains of simple queries, and UnderthinkingBench with 11 challenging reasoning tasks. Evaluated 33 thinking and non-thinking models using novel thinking-adjusted accuracy metrics.

Result: No model optimally balanced thinking across tasks. Thinking models wasted hundreds of tokens on simple queries without performance gains, while large non-thinking models underperformed compared to smaller thinking models on complex tasks.

Conclusion: Current approaches struggle to achieve optimal thinking - improving one sub-benchmark often comes at the expense of the other, highlighting the need for better unified models that can adapt thinking depth to task complexity.

Abstract: Thinking LLMs solve complex tasks at the expense of increased compute and overthinking on simpler problems, while non-thinking LLMs are faster and cheaper but underthink on harder reasoning problems. This has led to the development of separate thinking and non-thinking LLM variants, leaving the onus of selecting the optimal model for each query on the end user. We introduce OptimalThinkingBench, a unified benchmark that jointly evaluates overthinking and underthinking in LLMs and also encourages the development of optimally-thinking models that balance performance and efficiency. Our benchmark comprises two sub-benchmarks: OverthinkingBench, featuring simple math and general queries in 72 domains, and UnderthinkingBench, containing 11 challenging reasoning tasks along with harder math problems. Using novel thinking-adjusted accuracy metrics, we extensively evaluate 33 different thinking and non-thinking models and show that no model is able to optimally think on our benchmark. Thinking models often overthink for hundreds of tokens on the simplest user queries without improving performance. In contrast, large non-thinking models underthink, often falling short of much smaller thinking models. We further explore several methods to encourage optimal thinking, but find that these approaches often improve on one sub-benchmark at the expense of the other, highlighting the need for better unified and optimal models in the future.

[187] SurGE: A Benchmark and Evaluation Framework for Scientific Survey Generation

Weihang Su, Anzhe Xie, Qingyao Ai, Jianming Long, Jiaxin Mao, Ziyi Ye, Yiqun Liu

Main category: cs.CL

TL;DR: SurGE is a new benchmark for evaluating automated scientific survey generation in computer science, addressing the lack of standardized evaluation protocols.

DetailsMotivation: The rapid growth of academic literature makes manual survey creation infeasible, and current LLM-based approaches lack proper benchmarks for evaluation.

Method: SurGE includes test instances with topic descriptions, expert-written surveys, cited references, and a large academic corpus of over 1 million papers, plus an automated evaluation framework measuring four quality dimensions.

Result: Evaluation of diverse LLM-based methods shows significant performance gaps, with even advanced agentic frameworks struggling with survey generation complexities.

Conclusion: There is a clear need for future research in automated survey generation, and SurGE provides the necessary benchmark and evaluation framework to advance this field.

Abstract: The rapid growth of academic literature makes the manual creation of scientific surveys increasingly infeasible. While large language models show promise for automating this process, progress in this area is hindered by the absence of standardized benchmarks and evaluation protocols. To bridge this critical gap, we introduce SurGE (Survey Generation Evaluation), a new benchmark for scientific survey generation in computer science. SurGE consists of (1) a collection of test instances, each including a topic description, an expert-written survey, and its full set of cited references, and (2) a large-scale academic corpus of over one million papers. In addition, we propose an automated evaluation framework that measures the quality of generated surveys across four dimensions: comprehensiveness, citation accuracy, structural organization, and content quality. Our evaluation of diverse LLM-based methods demonstrates a significant performance gap, revealing that even advanced agentic frameworks struggle with the complexities of survey generation and highlighting the need for future research in this area. We have open-sourced all the code, data, and models at: https://github.com/oneal2000/SurGE

[188] OpenWHO: A Document-Level Parallel Corpus for Health Translation in Low-Resource Languages

Raphaël Merx, Hanna Suominen, Trevor Cohn, Ekaterina Vylomova

Main category: cs.CL

TL;DR: OpenWHO is a new document-level parallel corpus for health domain machine translation, covering 20+ languages including 9 low-resource ones. LLMs outperform traditional MT models, with Gemini 2.5 Flash showing +4.79 ChrF improvement over NLLB-54B.

DetailsMotivation: There is a lack of MT evaluation datasets for low-resource languages in the high-stakes health domain, despite widespread deployment and domain-specific vocabulary needs.

Method: Created OpenWHO corpus with 2,978 documents and 26,824 sentences from WHO’s e-learning platform, then evaluated modern LLMs against traditional MT models on this resource.

Result: LLMs consistently outperform traditional MT models, with Gemini 2.5 Flash achieving +4.79 ChrF improvement over NLLB-54B on low-resource test set. Document-level translation benefits are most pronounced in specialized domains like health.

Conclusion: The OpenWHO corpus addresses the gap in health domain MT evaluation and demonstrates LLMs’ superiority over traditional approaches, especially for low-resource languages in specialized domains.

Abstract: In machine translation (MT), health is a high-stakes domain characterised by widespread deployment and domain-specific vocabulary. However, there is a lack of MT evaluation datasets for low-resource languages in this domain. To address this gap, we introduce OpenWHO, a document-level parallel corpus of 2,978 documents and 26,824 sentences from the World Health Organization’s e-learning platform. Sourced from expert-authored, professionally translated materials shielded from web-crawling, OpenWHO spans a diverse range of over 20 languages, of which nine are low-resource. Leveraging this new resource, we evaluate modern large language models (LLMs) against traditional MT models. Our findings reveal that LLMs consistently outperform traditional MT models, with Gemini 2.5 Flash achieving a +4.79 ChrF point improvement over NLLB-54B on our low-resource test set. Further, we investigate how LLM context utilisation affects accuracy, finding that the benefits of document-level translation are most pronounced in specialised domains like health. We release the OpenWHO corpus to encourage further research into low-resource MT in the health domain.

[189] ObjexMT: Objective Extraction and Metacognitive Calibration for LLM-as-a-Judge under Multi-Turn Jailbreaks

Hyunjun Kim, Junwoo Ha, Sangyoon Yu, Haon Park

Main category: cs.CL

TL;DR: ObjexMT is a benchmark for testing LLM judges’ ability to extract latent objectives from multi-turn conversations and provide calibrated confidence scores, revealing significant challenges in objective inference and high-confidence errors.

DetailsMotivation: Current LLM-as-a-Judge systems lack decisive qualification tests for recovering latent conversation objectives and knowing when such inferences are trustworthy, especially given context degradation and multi-turn jailbreaks.

Method: ObjexMT benchmark requires models to extract one-sentence base objectives and self-reported confidence from multi-turn transcripts, with accuracy measured via LLM-judge semantic similarity to gold objectives and metacognition evaluated using ECE, Brier, Wrong at High-Confidence metrics.

Result: kimi-k2 achieved highest objective-extraction accuracy (0.612), claude-sonnet-4 showed best selective risk and calibration (AURC 0.242; ECE 0.206; Brier 0.254), with high-confidence errors ranging from 14.9% to 47.7% across models.

Conclusion: LLM judges often misinfer objectives when not explicit; recommendations include exposing objectives when feasible and gating decisions by confidence otherwise.

Abstract: LLM-as-a-Judge (LLMaaJ) now underpins scalable evaluation, yet we lack a decisive test of a judge’s qualification: can it recover a conversation’s latent objective and know when that inference is trustworthy? LLMs degrade under irrelevant or long context; multi-turn jailbreaks further hide goals across turns. We introduce ObjexMT, a benchmark for objective extraction and metacognition. Given a multi-turn transcript, a model must return a one-sentence base objective and self-reported confidence. Accuracy is computed via LLM-judge semantic similarity to gold objectives, converted to binary correctness by a human-aligned threshold calibrated on N=300 items (tau = 0.66; F1 = 0.891). Metacognition is evaluated with ECE, Brier, Wrong at High-Confidence (0.80/0.90/0.95), and risk-coverage. Across six models (gpt-4.1, claude-sonnet-4, Qwen3-235B-A22B-FP8, kimi-k2, deepseek-v3.1, gemini-2.5-flash) on three datasets, kimi-k2 attains the highest objective-extraction accuracy (0.612), with claude-sonnet-4 (0.603) and deepseek-v3.1 (0.599) statistically comparable. claude-sonnet-4 yields the best selective risk and calibration (AURC 0.242; ECE 0.206; Brier 0.254). Dataset heterogeneity (16-82 percent accuracy variance) reveals that automated obfuscation poses fundamental challenges beyond model choice. High-confidence errors persist: Wrong at 0.90 ranges from 14.9 percent (claude-sonnet-4) to 47.7 percent (Qwen3-235B-A22B-FP8). ObjexMT provides an actionable test for LLM judges: when objectives are not explicit, judges often misinfer them; we recommend exposing objectives when feasible and gating decisions by confidence otherwise. Data at https://github.com/hyunjun1121/ObjexMT_dataset.

[190] SSFO: Self-Supervised Faithfulness Optimization for Retrieval-Augmented Generation

Xiaqiang Tang, Yi Wang, Keyu Hu, Rui Xu, Chuang Li, Weigao Sun, Jian Li, Sihong Xie

Main category: cs.CL

TL;DR: SSFO is a self-supervised alignment method that enhances RAG faithfulness by constructing preference pairs from model outputs with/without context and using DPO to align models without labeling costs or inference burden.

DetailsMotivation: Existing methods for improving RAG faithfulness require costly supervision, post-training, or significant inference burdens, creating a need for more efficient approaches.

Method: SSFO constructs preference data pairs by contrasting model outputs with and without context, then uses Direct Preference Optimization (DPO) with a modified loss function to encourage likelihood displacement from parametric-based tokens to context-aligned tokens.

Result: SSFO significantly outperforms existing methods, achieving state-of-the-art faithfulness on multiple context-based QA datasets, with strong generalization including cross-lingual faithfulness and preserved instruction-following capabilities.

Conclusion: SSFO provides an effective self-supervised approach for enhancing RAG faithfulness without additional costs, leveraging likelihood displacement theory to achieve superior performance and generalization.

Abstract: Retrieval-Augmented Generation (RAG) systems require Large Language Models (LLMs) to generate responses that are faithful to the retrieved context. However, faithfulness hallucination remains a critical challenge, as existing methods often require costly supervision and post-training or significant inference burdens. To overcome these limitations, we introduce Self-Supervised Faithfulness Optimization (SSFO), the first self-supervised alignment approach for enhancing RAG faithfulness. SSFO constructs preference data pairs by contrasting the model’s outputs generated with and without the context. Leveraging Direct Preference Optimization (DPO), SSFO aligns model faithfulness without incurring labeling costs or additional inference burden. We theoretically and empirically demonstrate that SSFO leverages a benign form of \emph{likelihood displacement}, transferring probability mass from parametric-based tokens to context-aligned tokens. Based on this insight, we propose a modified DPO loss function to encourage likelihood displacement. Comprehensive evaluations show that SSFO significantly outperforms existing methods, achieving state-of-the-art faithfulness on multiple context-based question-answering datasets. Notably, SSFO exhibits strong generalization, improving cross-lingual faithfulness and preserving general instruction-following capabilities. We release our code and model at the anonymous link: https://github.com/chkwy/SSFO

[191] Filtering for Creativity: Adaptive Prompting for Multilingual Riddle Generation in LLMs

Duy Le, Kent Ziti, Evan Girard-Sun, Bakr Bouhaya, Sean O’Brien, Vasu Sharma, Kevin Zhu

Main category: cs.CL

TL;DR: AOF prompting framework improves multilingual riddle generation by filtering redundant outputs using cosine similarity, achieving better lexical diversity and cultural fluency without fine-tuning.

DetailsMotivation: Standard prompting methods for multilingual riddle generation tend to produce memorized or paraphrased content rather than original, culturally fluent riddles.

Method: Adaptive Originality Filtering (AOF) - a prompting framework that uses cosine-based similarity rejection to filter redundant generations while enforcing lexical novelty and cross-lingual fidelity.

Result: AOF-enhanced GPT-4o achieved 0.177 Self-BLEU and 0.915 Distinct-2 in Japanese, showing improved lexical diversity and reduced redundancy across three LLMs and four language pairs.

Conclusion: Semantic rejection through AOF can effectively guide culturally grounded, creative generation in multilingual contexts without requiring task-specific fine-tuning.

Abstract: Multilingual riddle generation challenges large language models (LLMs) to balance cultural fluency with creative abstraction. Standard prompting strategies – zero-shot, few-shot, chain-of-thought – tend to reuse memorized riddles or perform shallow paraphrasing. We introduce Adaptive Originality Filtering (AOF), a prompting framework that filters redundant generations using cosine-based similarity rejection, while enforcing lexical novelty and cross-lingual fidelity. Evaluated across three LLMs and four language pairs, AOF-enhanced GPT-4o achieves \texttt{0.177} Self-BLEU and \texttt{0.915} Distinct-2 in Japanese, signaling improved lexical diversity and reduced redundancy compared to other prompting methods and language pairs. Our findings show that semantic rejection can guide culturally grounded, creative generation without task-specific fine-tuning.

[192] Meta-Pretraining for Zero-Shot Cross-Lingual Named Entity Recognition in Low-Resource Philippine Languages

David Demitri Africa, Suchir Salhan, Yuval Weiss, Paula Buttery, Richard Diehl Martinez

Main category: cs.CL

TL;DR: Small decoder LMs pretrained with MAML show improved zero-shot NER performance in low-resource languages, with faster convergence and better handling of entity patterns.

DetailsMotivation: NER in low-resource languages typically requires large multilingual models, which are impractical for memory- or latency-constrained settings. The goal is to enable small decoder LMs to adapt quickly and transfer zero-shot to unseen languages.

Method: Replace part of autoregressive objective with first-order model-agnostic meta-learning (MAML) during pretraining. Test on Tagalog and Cebuano as challenging languages with typological similarity but structural differences in voice systems.

Result: MAML improves zero-shot micro-F1 by 2-6 pp with head-only tuning and 1-3 pp with full tuning across model sizes (11M-570M). Reduces convergence time by up to 8%. Best performance on single-token person entities co-occurring with Tagalog case particles.

Conclusion: MAML pretraining enables small LMs to better adapt to low-resource NER tasks, with particular strength in recognizing entities with surface anchors like case particles.

Abstract: Named-entity recognition (NER) in low-resource languages is usually tackled by finetuning very large multilingual LMs, an option that is often infeasible in memory- or latency-constrained settings. We ask whether small decoder LMs can be pretrained so that they adapt quickly and transfer zero-shot to languages unseen during pretraining. To this end we replace part of the autoregressive objective with first-order model-agnostic meta-learning (MAML). Tagalog and Cebuano are typologically similar yet structurally different in their actor/non-actor voice systems, and hence serve as a challenging test-bed. Across four model sizes (11 M - 570 M) MAML lifts zero-shot micro-F1 by 2-6 pp under head-only tuning and 1-3 pp after full tuning, while cutting convergence time by up to 8%. Gains are largest for single-token person entities that co-occur with Tagalog case particles si/ni, highlighting the importance of surface anchors.

[193] Post-training Large Language Models for Diverse High-Quality Responses

Yilei Chen, Souradip Chakraborty, Lorenz Wolf, Yannis Paschalidis, Aldo Pacchiano

Main category: cs.CL

TL;DR: DQO is a novel training method that uses determinantal point processes to optimize LLMs for both quality and semantic diversity, addressing the diversity loss problem in RL-based post-training.

DetailsMotivation: RL-based post-training of LLMs often reduces output diversity, leading to narrow, canonical responses. Existing diversity enhancement methods are limited to inference-time operations or focus only on surface-level differences.

Method: Proposes DQO (Diversity Quality Optimization) based on determinantal point processes (DPPs). Samples and embeds multiple responses per prompt, then uses the determinant of a kernel-based similarity matrix to measure diversity as the volume spanned by response embeddings. Can be applied on top of existing RL algorithms.

Result: Experiments across instruction-following, summarization, story generation, and reasoning tasks show that DQO substantially improves semantic diversity without sacrificing model quality.

Conclusion: DQO provides a flexible and effective approach to jointly optimize LLMs for both quality and semantic diversity, overcoming limitations of existing diversity enhancement methods.

Abstract: Reinforcement learning (RL) has emerged as a popular method for post-training large language models (LLMs). While improving the model’s performance on downstream tasks, it often reduces the model’s output diversity, leading to narrow, canonical responses. Existing methods to enhance diversity are limited, either by operating at inference time or by focusing on surface-level differences. We propose a novel training method named DQO (Diversity Quality Optimization) based on determinantal point processes (DPPs) to jointly optimize LLMs for quality and semantic diversity. Our approach samples and embeds a group of responses for each prompt, then uses the determinant of a kernel-based similarity matrix to measure diversity as the volume spanned by the embeddings of these responses. DQO is flexible and can be applied on top of existing RL algorithms. Experiments across instruction-following, summarization, story generation, and reasoning tasks demonstrate that our method substantially improves semantic diversity without sacrificing model quality.

[194] X-Teaming Evolutionary M2S: Automated Discovery of Multi-turn to Single-turn Jailbreak Templates

Hyunjun Kim, Junwoo Ha, Sangyoon Yu, Haon Park

Main category: cs.CL

TL;DR: X-Teaming Evolutionary M2S automates the discovery and optimization of multi-turn-to-single-turn (M2S) templates through language-model-guided evolution, achieving 44.8% success rate on GPT-4.1 and demonstrating transferable structural gains across models.

DetailsMotivation: To automate the process of discovering and optimizing M2S templates, moving beyond manually written templates used in prior work.

Method: Uses language-model-guided evolution with smart sampling from 12 sources and LLM-as-judge approach inspired by StrongREJECT, maintaining selection pressure with success threshold θ=0.70 across five evolutionary generations.

Result: Achieved 44.8% overall success (103/230) on GPT-4.1, discovered two new template families, and showed structural gains transfer across models but vary by target. Found positive correlation between prompt length and score.

Conclusion: Structure-level search is a reproducible method for stronger single-turn probes, highlighting the importance of threshold calibration and cross-model evaluation.

Abstract: Multi-turn-to-single-turn (M2S) compresses iterative red-teaming into one structured prompt, but prior work relied on a handful of manually written templates. We present X-Teaming Evolutionary M2S, an automated framework that discovers and optimizes M2S templates through language-model-guided evolution. The system pairs smart sampling from 12 sources with an LLM-as-judge inspired by StrongREJECT and records fully auditable logs. Maintaining selection pressure by setting the success threshold to $\theta = 0.70$, we obtain five evolutionary generations, two new template families, and 44.8% overall success (103/230) on GPT-4.1. A balanced cross-model panel of 2,500 trials (judge fixed) shows that structural gains transfer but vary by target; two models score zero at the same threshold. We also find a positive coupling between prompt length and score, motivating length-aware judging. Our results demonstrate that structure-level search is a reproducible route to stronger single-turn probes and underscore the importance of threshold calibration and cross-model evaluation. Code, configurations, and artifacts are available at https://github.com/hyunjun1121/M2S-x-teaming.

[195] Share Your Attention: Transformer Weight Sharing via Matrix-based Dictionary Learning

Magauiya Zhussip, Dmitriy Shopkhoev, Ammar Ali, Stamatios Lefkimmiatis

Main category: cs.CL

TL;DR: MASA is a framework that reduces transformer parameters by 66.7% through structured weight sharing across layers, using shared dictionary atoms for attention projection matrices, achieving comparable performance to full models.

DetailsMotivation: LLMs have high computational and memory demands, and existing compression techniques focus on intra-block optimizations while ignoring inter-block redundancy in transformers' repetitive layered structure.

Method: Decomposes attention projection matrices into shared dictionary atoms, representing each layer’s weights as linear combinations of these shared matrix atoms. Operates as a drop-in replacement trained with standard optimizers.

Result: Achieves better benchmark accuracy and perplexity than grouped-query attention, low-rank baselines, and other sharing methods at comparable parameter budgets. Works across scales (100M-700M parameters) and extends to Vision Transformers with similar performance gains.

Conclusion: MASA offers a scalable blueprint for parameter-efficient models without sacrificing performance by combining dictionary learning with transformer efficiency, and can potentially reduce parameters in pretrained LLMs without significant performance drops.

Abstract: Large language models (LLMs) have revolutionized AI applications, yet their high computational and memory demands hinder their widespread deployment. Existing compression techniques focus on intra-block optimizations (e.g. low-rank approximation, attention head pruning), while the repetitive layered structure of transformers implies significant inter-block redundancy - a dimension largely unexplored beyond key-value (KV) caching. Inspired by dictionary learning in CNNs, we propose a framework for structured weight sharing across transformer layers. Our approach decomposes attention projection matrices into shared dictionary atoms, reducing the attention module’s parameters by 66.7% while achieving on-par performance. Unlike complex methods requiring distillation or architectural changes, MASA (Matrix Atom Sharing in Attention) operates as a drop-in replacement - trained with standard optimizers

  • and represents each layer’s weights as linear combinations of shared matrix atoms. Experiments across scales (100M-700M parameters) show that MASA achieves better benchmark accuracy and perplexity than grouped-query attention (GQA), low-rank baselines and recently proposed Repeat-all-over/Sequential sharing at comparable parameter budgets. Ablation studies confirm robustness to the dictionary size and the efficacy of shared representations in capturing cross-layer statistical regularities. Extending to Vision Transformers (ViT), MASA matches performance metrics on image classification and detection tasks with 66.7% fewer attention parameters. By combining dictionary learning strategies with transformer efficiency, MASA offers a scalable blueprint for parameter-efficient models without sacrificing performance. Finally, we investigate the possibility of employing MASA on pretrained LLMs to reduce their number of parameters without experiencing any significant drop in their performance.

[196] Large Language Model Hacking: Quantifying the Hidden Risks of Using LLMs for Text Annotation

Joachim Baumann, Paul Röttger, Aleksandra Urman, Albert Wendsjö, Flor Miriam Plaza-del-Arco, Johannes B. Gruber, Dirk Hovy

Main category: cs.CL

TL;DR: LLM hacking is a phenomenon where configuration choices in large language models lead to incorrect conclusions in social science research, with both intentional manipulation and accidental errors causing significant issues.

DetailsMotivation: To investigate how implementation choices in LLMs (model selection, prompting strategies) introduce systematic biases and errors in social science research, potentially leading to false conclusions.

Method: Replicated 37 data annotation tasks from 21 published studies, tested 13 million labels from 18 different LLMs across 2361 hypotheses, and analyzed 21 mitigation techniques.

Result: Intentional LLM hacking is simple - with just a few prompt paraphrases, virtually anything can be made statistically significant. Accidental LLM hacking affects ~31% of hypotheses for state-of-the-art models and ~50% for smaller models. Human annotations provide crucial protection, and regression corrections can restore valid inference.

Conclusion: LLM hacking poses serious risks to research validity. Higher model capabilities reduce but don’t eliminate risk. Practical recommendations are needed to prevent manipulation and accidental errors, especially near significance thresholds.

Abstract: Large language models are rapidly transforming social science research by enabling the automation of labor-intensive tasks like data annotation and text analysis. However, LLM outputs vary significantly depending on the implementation choices made by researchers (e.g., model selection or prompting strategy). Such variation can introduce systematic biases and random errors, which propagate to downstream analyses and cause Type I (false positive), Type II (false negative), Type S (wrong sign), or Type M (exaggerated effect) errors. We call this phenomenon where configuration choices lead to incorrect conclusions LLM hacking. We find that intentional LLM hacking is strikingly simple. By replicating 37 data annotation tasks from 21 published social science studies, we show that, with just a handful of prompt paraphrases, virtually anything can be presented as statistically significant. Beyond intentional manipulation, our analysis of 13 million labels from 18 different LLMs across 2361 realistic hypotheses shows that there is also a high risk of accidental LLM hacking, even when following standard research practices. We find incorrect conclusions in approximately 31% of hypotheses for state-of-the-art LLMs, and in half the hypotheses for smaller language models. While higher task performance and stronger general model capabilities reduce LLM hacking risk, even highly accurate models remain susceptible. The risk of LLM hacking decreases as effect sizes increase, indicating the need for more rigorous verification of LLM-based findings near significance thresholds. We analyze 21 mitigation techniques and find that human annotations provide crucial protection against false positives. Common regression estimator correction techniques can restore valid inference but trade off Type I vs. Type II errors. We publish a list of practical recommendations to prevent LLM hacking.

[197] Population-Aligned Persona Generation for LLM-based Social Simulation

Zhengyu Hu, Jianxun Lian, Zheyuan Xiao, Max Xiong, Yuxuan Lei, Tianfu Wang, Kaize Ding, Ziang Xiao, Nicholas Jing Yuan, Xing Xie

Main category: cs.CL

TL;DR: A systematic framework for generating high-quality, population-aligned persona sets for LLM-driven social simulations that reduces bias and improves simulation accuracy.

DetailsMotivation: Existing LLM-based social simulations often overlook persona generation complexities and introduce biases through unrepresentative persona sets, limiting their authenticity for computational social science.

Method: Leverages LLMs to generate narrative personas from social media data, applies quality assessment filtering, uses importance sampling for global alignment with psychometric distributions, and includes task-specific adaptation for subpopulations.

Result: Extensive experiments show the method significantly reduces population-level bias and enables accurate, flexible social simulation for various research and policy applications.

Conclusion: The proposed framework provides a robust solution for synthesizing representative persona sets that authentically capture real-world population diversity, advancing LLM-based social simulation capabilities.

Abstract: Recent advances in large language models (LLMs) have enabled human-like social simulations at unprecedented scale and fidelity, offering new opportunities for computational social science. A key challenge, however, is the construction of persona sets that authentically represent the diversity and distribution of real-world populations. Most existing LLM-based social simulation studies focus primarily on designing agentic frameworks and simulation environments, often overlooking the complexities of persona generation and the potential biases introduced by unrepresentative persona sets. In this paper, we propose a systematic framework for synthesizing high-quality, population-aligned persona sets for LLM-driven social simulation. Our approach begins by leveraging LLMs to generate narrative personas from long-term social media data, followed by rigorous quality assessment to filter out low-fidelity profiles. We then apply importance sampling to achieve global alignment with reference psychometric distributions, such as the Big Five personality traits. To address the needs of specific simulation contexts, we further introduce a task-specific module that adapts the globally aligned persona set to targeted subpopulations. Extensive experiments demonstrate that our method significantly reduces population-level bias and enables accurate, flexible social simulation for a wide range of research and policy applications.

[198] CEMTM: Contextual Embedding-based Multimodal Topic Modeling

Amirhossein Abaskohi, Raymond Li, Chuyuan Li, Shafiq Joty, Giuseppe Carenini

Main category: cs.CL

TL;DR: CEMTM is a context-enhanced multimodal topic model that infers coherent topics from documents with text and images using fine-tuned LVLMs and distributional attention, outperforming existing methods on multiple benchmarks.

DetailsMotivation: To address the challenge of inferring interpretable topic structures from multimodal documents (containing both text and images), especially handling both short and long documents and multiple images per document efficiently.

Method: Uses fine-tuned large vision language models for contextualized embeddings, employs distributional attention mechanism to weight token contributions, and uses reconstruction objective to align topic representations with document embeddings.

Result: Outperforms unimodal and multimodal baselines on six benchmarks, achieving average LLM score of 2.61, and demonstrates effectiveness in few-shot retrieval and capturing visually grounded semantics in scientific articles.

Conclusion: CEMTM provides an effective solution for multimodal topic modeling that maintains interpretability while handling multiple images efficiently and achieving superior performance across various domains.

Abstract: We introduce CEMTM, a context-enhanced multimodal topic model designed to infer coherent and interpretable topic structures from both short and long documents containing text and images. CEMTM builds on fine-tuned large vision language models (LVLMs) to obtain contextualized embeddings, and employs a distributional attention mechanism to weight token-level contributions to topic inference. A reconstruction objective aligns topic-based representations with the document embedding, encouraging semantic consistency across modalities. Unlike existing approaches, CEMTM can process multiple images per document without repeated encoding and maintains interpretability through explicit word-topic and document-topic distributions. Extensive experiments on six multimodal benchmarks show that CEMTM consistently outperforms unimodal and multimodal baselines, achieving a remarkable average LLM score of 2.61. Further analysis shows its effectiveness in downstream few-shot retrieval and its ability to capture visually grounded semantics in complex domains such as scientific articles.

[199] Analyzing Information-Seeking Behaviors in a Hakka AI Chatbot: A Cognitive-Pragmatic Study

Chu-Hsuan Lee, Chen-Chi Chang, Hung-Shin Lee, Yun-Hsiang Hsu, Ching-Yuan Chen

Main category: cs.CL

TL;DR: Analysis of user interactions with TALKA, an AI chatbot for Hakka language learning, using Bloom’s Taxonomy and dialogue act categorization to understand cognitive processes and communication patterns in endangered language preservation.

DetailsMotivation: To address the risk of endangered languages disappearing by examining how technology and culturally informed teaching strategies can support language preservation through AI-powered chatbots.

Method: Dual-layered analytical framework combining Bloom’s Taxonomy (six cognitive levels) and dialogue act categorization (eleven types), applied to 7,077 annotated user utterances from TALKA chatbot interactions.

Result: Generative AI chatbots effectively support language learning by aligning dialogue acts with cognitive intentions, helping learners express themselves confidently and connect with cultural identity through various interaction types including information requests, translations, cultural inquiries, and creative language use.

Conclusion: AI-mediated dialogue facilitates cognitive development, pragmatic negotiation, and socio-cultural affiliation in low-resource language learners, offering empirical insights for technology-supported language preservation and educational practice.

Abstract: With many endangered languages at risk of disappearing, efforts to preserve them now rely more than ever on using technology alongside culturally informed teaching strategies. This study examines user behaviors in TALKA, a generative AI-powered chatbot designed for Hakka language engagement, by employing a dual-layered analytical framework grounded in Bloom’s Taxonomy of cognitive processes and dialogue act categorization. We analyzed 7,077 user utterances, each carefully annotated according to six cognitive levels and eleven dialogue act types. These included a variety of functions, such as asking for information, requesting translations, making cultural inquiries, and using language creatively. Pragmatic classifications further highlight how different types of dialogue acts–such as feedback, control commands, and social greetings–align with specific cognitive intentions. The results suggest that generative AI chatbots can support language learning in meaningful ways–especially when they are designed with an understanding of how users think and communicate. They may also help learners express themselves more confidently and connect with their cultural identity. The TALKA case provides empirical insights into how AI-mediated dialogue facilitates cognitive development in low-resource language learners, as well as pragmatic negotiation and socio-cultural affiliation. By focusing on AI-assisted language learning, this study offers new insights into how technology can support language preservation and educational practice.

[200] Reasoning over Boundaries: Enhancing Specification Alignment via Test-time Deliberation

Haoran Zhang, Yafu Li, Xuyang Hu, Dongrui Liu, Zhilin Wang, Bo Li, Yu Cheng

Main category: cs.CL

TL;DR: Align3 is a lightweight test-time deliberation method that uses hierarchical reflection and revision to help LLMs follow dynamic, scenario-specific behavioral and safety specifications, improving specification alignment with minimal overhead.

DetailsMotivation: LLMs are increasingly used in diverse real-world scenarios with bespoke behavioral and safety specifications that vary across scenarios and evolve over time, creating a need for effective specification alignment.

Method: Align3 employs Test-Time Deliberation (TTD) with hierarchical reflection and revision to reason over specification boundaries. The method is evaluated using SpecBench, a unified benchmark covering 5 scenarios, 103 specifications, and 1,500 prompts.

Result: Experiments on 15 reasoning and 18 instruct models show that: (i) test-time deliberation enhances specification alignment; (ii) Align3 advances the safety-helpfulness trade-off frontier with minimal overhead; (iii) SpecBench effectively reveals alignment gaps.

Conclusion: Test-time deliberation is an effective strategy for reasoning over real-world specification boundaries, with Align3 demonstrating improved specification alignment while maintaining a good safety-helpfulness balance.

Abstract: Large language models (LLMs) are increasingly applied in diverse real-world scenarios, each governed by bespoke behavioral and safety specifications (spec) custom-tailored by users or organizations. These spec, categorized into safety-spec and behavioral-spec, vary across scenarios and evolve with changing preferences and requirements. We formalize this challenge as specification alignment, focusing on LLMs’ ability to follow dynamic, scenario-specific spec from both behavioral and safety perspectives. To address this challenge, we propose Align3, a lightweight method that employs Test-Time Deliberation (TTD) with hierarchical reflection and revision to reason over the specification boundaries. We further present SpecBench, a unified benchmark for measuring specification alignment, covering 5 scenarios, 103 spec, and 1,500 prompts. Experiments on 15 reasoning and 18 instruct models with several TTD methods, including Self-Refine, TPO, and MoreThink, yield three key findings: (i) test-time deliberation enhances specification alignment; (ii) Align3 advances the safety-helpfulness trade-off frontier with minimal overhead; (iii) SpecBench effectively reveals alignment gaps. These results highlight the potential of test-time deliberation as an effective strategy for reasoning over the real-world specification boundaries.

[201] Rethinking the Role of Text Complexity in Language Model Pretraining

Dan John Velasco, Matthew Theodore Roque

Main category: cs.CL

TL;DR: The paper investigates how text complexity affects language model pretraining, finding that perplexity depends on model capacity-complexity interaction, while downstream performance varies: simpler texts help linguistic tasks, complex texts benefit world knowledge tasks.

DetailsMotivation: To understand the role of text complexity in language model pretraining and how it affects downstream performance across different model sizes and evaluation setups.

Method: Simplified human-written texts using LLM while preserving core content, pretrained causal models (28M-500M parameters) from scratch on original vs. simplified data, evaluated in fine-tuning and zero-shot settings.

Result: Smaller models degrade less on simpler texts; text complexity has minimal impact on fine-tuning but affects zero-shot performance: simpler texts help linguistic knowledge tasks, complex texts benefit world knowledge and entity tracking tasks.

Conclusion: Different types of data diversity affect transfer and zero-shot performance differently, suggesting data curation should be tailored to specific goals rather than using a one-size-fits-all approach.

Abstract: Improving pretraining data quality and size is known to boost downstream performance, but the role of text complexity–how hard a text is to read–remains less explored. We reduce surface-level complexity (shorter sentences, simpler words, simpler structure) while keeping core content approximately constant and ask: (i) How does complexity affect language modeling across model sizes? (ii) Can useful representations be learned from simpler text alone? (iii) How does pretraining text complexity influence downstream language understanding? We simplify human-written texts using a large language model, pretrain causal models (28M-500M) from scratch on original vs. simplified data, and evaluate them in fine-tuning and zero-shot setups. We find that perplexity is sensitive to the interaction between model capacity and text complexity–smaller models degrade far less on simpler texts–while text complexity has little impact on fine-tuning evaluations, with zero-shot evaluations indicating that simpler texts benefit performance on linguistic knowledge tasks, whereas more complex texts favor tasks requiring world knowledge and entity tracking. Our findings suggest that different types of data diversity affect transfer and zero-shot performance differently, providing insight into tailoring data curation to specific goals.

[202] K-DeCore: Facilitating Knowledge Transfer in Continual Structured Knowledge Reasoning via Knowledge Decoupling

Yongrui Chen, Yi Huang, Yunchang Liu, Shenyu Zhang, Junhao He, Tongtong Wu, Guilin Qi, Tianxing Wu

Main category: cs.CL

TL;DR: K-DeCore is a novel framework for Continual Structured Knowledge Reasoning that uses knowledge decoupling and dual-perspective memory consolidation to handle sequential tasks with fixed parameters, outperforming existing continual learning methods.

DetailsMotivation: Existing continual learning approaches struggle with poor generalization to heterogeneous structured knowledge and inefficient reasoning due to parameter growth as tasks increase in CSKR.

Method: Proposes K-DeCore with knowledge decoupling mechanism that disentangles reasoning into task-specific and task-agnostic stages, dual-perspective memory consolidation, and structure-guided pseudo-data synthesis.

Result: Extensive experiments on four benchmark datasets show K-DeCore’s superiority over existing continual learning methods across multiple metrics using various backbone LLMs.

Conclusion: K-DeCore effectively addresses CSKR challenges by decoupling knowledge reasoning and maintaining fixed parameters, demonstrating strong performance across diverse structured knowledge tasks.

Abstract: Continual Structured Knowledge Reasoning (CSKR) focuses on training models to handle sequential tasks, where each task involves translating natural language questions into structured queries grounded in structured knowledge. Existing general continual learning approaches face significant challenges when applied to this task, including poor generalization to heterogeneous structured knowledge and inefficient reasoning due to parameter growth as tasks increase. To address these limitations, we propose a novel CSKR framework, \textsc{K-DeCore}, which operates with a fixed number of tunable parameters. Unlike prior methods, \textsc{K-DeCore} introduces a knowledge decoupling mechanism that disentangles the reasoning process into task-specific and task-agnostic stages, effectively bridging the gaps across diverse tasks. Building on this foundation, \textsc{K-DeCore} integrates a dual-perspective memory consolidation mechanism for distinct stages and introduces a structure-guided pseudo-data synthesis strategy to further enhance the model’s generalization capabilities. Extensive experiments on four benchmark datasets demonstrate the superiority of \textsc{K-DeCore} over existing continual learning methods across multiple metrics, leveraging various backbone large language models.

[203] Instruction Boundary: Quantifying Biases in LLM Reasoning under Various Coverage

Zipeng Ling, Yuehao Tang, Chen Huang, Shuliang Liu, Gaoyang Jiang, Shenghong Fu, Junqi Yang, Yao Wan, Jiawan Zhang, Kejia Huang, Xuming Hu

Main category: cs.CL

TL;DR: The paper investigates how different instruction formats affect LLM reasoning on datasets with sparse labels (flawed questions), introducing the concept of Instruction Boundary to analyze prompt coverage effects.

DetailsMotivation: Automatically generated datasets often contain inherent flaws like questions with none/multiple correct options or vague statements, creating sparse labels that challenge LLM reasoning.

Method: Proposed BiasDetector framework to quantify LLMs’ ability to identify sparse labels under different Instruction Boundary conditions (sufficient, redundant, insufficient prompt coverage), tested across 8 experimental settings and 5 dataset forms.

Result: Evaluations on 5 mainstream LLMs reveal substantial reasoning biases persist despite high accuracy, directly caused by prompt coverage issues in downstream tasks.

Conclusion: Findings emphasize the importance of addressing sparse labels and the need for developers to recognize and mitigate risks introduced by Instruction Boundary in LLM applications.

Abstract: Nowadays, automatically generated datasets are increasingly used in LLM reasoning tasks; however, large-scale corpora often contain inherent flaws. For example, a single-choice question may include none or multiple correct options, while true-or-false questions may involve vague or unverifiable statements. We refer to these exceptional answer forms as sparse labels. To compare LLMs’ ability to recognize various question forms and produce correct answers, we investigate how different instruction formats can either facilitate or mislead LLM reasoning ability. We introduce the concept of Instruction Boundary, which systematically analyzes how different levels of prompt coverage – sufficient, redundant, or insufficient – can lead to reasoning biases and performance changes in LLMs. To examine this phenomenon, we design eight experimental settings across five dataset forms. We further propose BiasDetector, a unified framework that quantifies LLMs’ ability to identify sparse labels under different kinds of Instruction Boundary conditions. Evaluations on five mainstream LLMs show that, despite their seemingly high accuracy, substantial reasoning biases persist in many downstream tasks as a direct consequence of prompt coverage. We analyze the impact of these biases and outline possible mitigation strategies. Our findings highlight not only the importance of addressing sparse labels, but also the need for developers to recognize and mitigate the risks introduced by Instruction Boundary.

[204] COSPADI: Compressing LLMs via Calibration-Guided Sparse Dictionary Learning

Dmitriy Shopkhoev, Denis Makhov, Magauiya Zhussip, Ammar Ali, Stamatios Lefkimmiatis

Main category: cs.CL

TL;DR: CoSpaDi is a training-free compression framework that uses structured sparse dictionary learning instead of low-rank approximation for LLM compression, achieving better accuracy with union-of-subspaces representation and data-aware optimization.

DetailsMotivation: Low-rank weight approximation for LLM compression is computationally efficient but rigid, leading to significant accuracy drops. A more flexible approach is needed that can adapt to different weight structures without requiring retraining.

Method: Proposes structured sparse factorization using dense dictionary and column-sparse coefficient matrix, enabling union-of-subspaces representation. Uses calibration data to optimize factorization by minimizing functional reconstruction error rather than weight approximation.

Result: Outperforms state-of-the-art low-rank methods in accuracy and perplexity across Llama and Qwen models at 20-50% compression ratios. Preserves model fidelity without fine-tuning and is compatible with quantization for additional gains.

Conclusion: Structured sparse dictionary learning is a powerful alternative to conventional low-rank approaches for efficient LLM deployment, offering greater expressiveness and better accuracy preservation.

Abstract: Post-training compression of large language models (LLMs) largely relies on low-rank weight approximation, which represents each column of a weight matrix in a shared low-dimensional subspace. While this is a computationally efficient strategy, the imposed structural constraint is rigid and can lead to a noticeable model accuracy drop. In this work, we propose CoSpaDi (Compression via Sparse Dictionary Learning), a novel training-free compression framework that replaces low-rank decomposition with a more flexible structured sparse factorization in which each weight matrix is represented with a dense dictionary and a column-sparse coefficient matrix. This formulation enables a union-of-subspaces representation: different columns of the original weight matrix are approximated in distinct subspaces spanned by adaptively selected dictionary atoms, offering greater expressiveness than a single invariant basis. Crucially, CoSpaDi leverages a small calibration dataset to optimize the factorization such that the output activations of compressed projection layers closely match those of the original ones, thereby minimizing functional reconstruction error rather than mere weight approximation. This data-aware strategy preserves better model fidelity without any fine-tuning under reasonable compression ratios. Moreover, the resulting structured sparsity allows efficient sparse-dense matrix multiplication and is compatible with post-training quantization for further memory and latency gains. We evaluate CoSpaDi across multiple Llama and Qwen models under per-layer and per-group settings at 20-50% compression ratios, demonstrating consistent superiority over state-of-the-art data-aware low-rank methods both in accuracy and perplexity. Our results establish structured sparse dictionary learning as a powerful alternative to conventional low-rank approaches for efficient LLM deployment.

[205] ML2B: Multi-Lingual ML Benchmark For AutoML

Ekaterina Trofimova, Zosia Shamina, Maria Selifanova, Artem Zaitsev, Remi Savchuk, Maxim Minets, Daria Ozerova, Emil Sataev, Denis Zuenko, Andrey E. Ustyuzhanin

Main category: cs.CL

TL;DR: ML2B is the first benchmark for evaluating multilingual ML code generation, covering 30 Kaggle competitions translated into 13 languages, revealing 15-45% performance degradation on non-English tasks.

DetailsMotivation: Existing ML code generation benchmarks are mainly restricted to English, overlooking the global and multilingual nature of ML research and practice.

Method: Created ML2B benchmark with 30 Kaggle competitions translated into 13 languages across tabular, text, and image data types. Used AIDE framework for automated end-to-end assessment of data science pipelines.

Result: Substantial 15-45% performance degradation on non-English tasks compared to English, highlighting challenges in multilingual representation learning for code generation.

Conclusion: The benchmark and evaluation framework are made available to facilitate future research in multilingual ML code generation, addressing critical gaps in current evaluation practices.

Abstract: Large language models (LLMs) have recently demonstrated strong capabilities in generating machine learning (ML) code, enabling end-to-end pipeline construction from natural language instructions. However, existing benchmarks for ML code generation are mainly restricted to English, overlooking the global and multilingual nature of ML research and practice. To address this gap, we present ML2B, the first benchmark for evaluating multilingual ML code generation. ML2B consists of 30 Kaggle competitions translated into 13 natural languages, covering tabular, text, and image data types, with structured metadata and validated human-reviewed translations. For evaluation, we employ AIDE, an automated framework for end-to-end assessment of data science pipelines, and provide insights into cross-lingual model performance. Our results reveal substantial 15-45% performance degradation on non-English tasks, highlighting critical challenges in multilingual representation learning for code generation. The benchmark, evaluation framework, and comprehensive results are made available through our GitHub repository to facilitate future research in multilingual ML code generation: https://github.com/enaix/ml2b.

[206] HEART: Emotionally-driven test-time scaling of Language Models

Gabriela Pinto, Palash Goyal, Yiwen Song, Souradip Chakraborty, Zifeng Wang, Tomas Pfister, Hamid Palangi

Main category: cs.CL

TL;DR: HEART is a novel framework that uses emotionally-driven prompts for iterative self-correction in language models, improving reasoning performance through affective feedback based on six universal emotions.

DetailsMotivation: Current test-time scaling strategies focus on logical refinement but ignore the potential of affective feedback, despite psychological research showing emotions can modulate cognitive performance.

Method: HEART provides feedback on incorrect responses using emotionally charged phrases based on Ekman’s six universal emotions, systematically varying emotional tone across iterations to guide models away from flawed reasoning paths.

Result: When guided by an oracle verifier, HEART unlocks significantly deeper reasoning and achieves substantial accuracy improvements over state-of-the-art baselines on challenging benchmarks. However, it struggles in verifier-free settings.

Conclusion: The next frontier in machine reasoning may lie in understanding and leveraging the ‘HEART’ of models - combining emotional guidance with logical refinement, though practical deployment requires solving the verifier-free challenge.

Abstract: Test-time scaling has shown considerable success in improving the performance of language models on complex reasoning tasks without requiring fine-tuning. However, current strategies such as self-reflection primarily focus on logical or structural refinement. They do not leverage the guiding potential of affective feedback. Inspired by psychological research showing that emotions can modulate cognitive performance, we introduce HEART–a novel framework that uses emotionally-driven prompts for iterative self-correction. HEART provides feedback on a model’s incorrect response using a curated set of concise, emotionally charged phrases based on the six universal emotions categorized by Dr. Paul Ekman. By systematically varying the emotional tone of the feedback across iterations, our method guides the model to escape flawed reasoning paths and explore more promising alternatives. We evaluate our framework on challenging reasoning benchmarks including OlympiadBench, Humanity’s Last Exam, and SimpleQA. Our results reveal a significant new phenomenon: when guided by an oracle verifier, this affective iteration protocol unlocks significantly deeper reasoning, leading to consistent and substantial increases in accuracy over state-of-the-art baselines with the same verifier. However, we also identify a critical bottleneck for practical deployment. In a verifier-free setting, it struggles to harness these gains consistently, highlighting as a key challenge for future work. Our findings suggest that the next frontier in machine reasoning may lie not just in refining logic, but also in understanding and leveraging the `HEART’ of the models.

[207] Non-Collaborative User Simulators for Tool Agents

Jeonghoon Shim, Woojung Song, Cheyon Jin, Seungwon KooK, Yohan Jo

Main category: cs.CL

TL;DR: Proposes a user simulator that simulates four types of non-collaborative behaviors to train and test tool agents against challenging real-world user interactions.

DetailsMotivation: Existing user simulators are too agent-friendly and cooperative, failing to prepare tool agents for non-collaborative users encountered in real-world scenarios.

Method: Novel user simulator architecture that simulates four categories of non-collaborative behaviors: requesting unavailable services, digressing conversations, expressing impatience, and providing incomplete utterances.

Result: Experiments on MultiWOZ and τ-bench show significant performance degradation in state-of-the-art tool agents when facing non-collaborative users, with issues like escalated hallucinations and dialogue breakdowns.

Conclusion: Provides an extensible user simulation framework to help develop more robust tool agents and preemptively diagnose weaknesses under challenging real-world conditions.

Abstract: Tool agents interact with users through multi-turn dialogues to accomplish various tasks. Recent studies have adopted user simulation methods to develop these agents in multi-turn settings. However, existing user simulators tend to be agent-friendly, exhibiting only cooperative behaviors, which fails to train and test agents against non-collaborative users in the real world. To address this, we propose a novel user simulator architecture that simulates four categories of non-collaborative behaviors: requesting unavailable services, digressing into tangential conversations, expressing impatience, and providing incomplete utterances. Our user simulator can simulate challenging and natural non-collaborative behaviors while reliably delivering all intents and information necessary to accomplish the task. Our experiments on MultiWOZ and $\tau$-bench reveal significant performance degradation in state-of-the-art tool agents when encountering non-collaborative users. We provide detailed analyses of agents’ weaknesses under each non-collaborative condition, such as escalated hallucinations and dialogue breakdowns. Ultimately, we contribute an easily extensible user simulation framework to help the research community develop tool agents and preemptively diagnose them under challenging real-world conditions within their own services.

[208] jina-reranker-v3: Last but Not Late Interaction for Listwise Document Reranking

Feng Wang, Yuqing Li, Han Xiao

Main category: cs.CL

TL;DR: jina-reranker-v3 is a 0.6B multilingual listwise reranker using ’last but not late’ interaction that achieves SOTA BEIR performance (61.94 nDCG@10) with significantly smaller size.

DetailsMotivation: To improve reranking performance while reducing model size, addressing limitations of late interaction models like ColBERT that encode documents separately before matching.

Method: Uses ’last but not late’ interaction with causal attention between query and all candidate documents in same context window, then extracts contextual embeddings from each document’s final token.

Result: Achieves state-of-the-art BEIR performance with 61.94 nDCG@10 while being significantly smaller than comparable models.

Conclusion: The novel ’last but not late’ interaction approach enables superior reranking performance with a much more compact model architecture.

Abstract: jina-reranker-v3 is a 0.6B-parameter multilingual listwise reranker that introduces a novel “last but not late” interaction. Unlike late interaction models like ColBERT that encode documents separately before multi-vector matching, our approach applies causal attention between the query and all candidate documents in the same context window, enabling rich interactions before extracting contextual embeddings from each document’s final token. The new model achieves state-of-the-art BEIR performance with 61.94 nDCG@10 while being significantly smaller than other models with comparable performance.

[209] Latent Thinking Optimization: Your Latent Reasoning Language Model Secretly Encodes Reward Signals in Its Latent Thoughts

Hanwen Du, Yuxin Dong, Xia Ning

Main category: cs.CL

TL;DR: The paper proposes Latent Thinking Optimization (LTO), a method that uses a latent classifier as a reward model to optimize latent thinking processes in LLMs, improving reasoning efficiency and reliability.

DetailsMotivation: Traditional verbal chain-of-thought reasoning in LLMs is computationally expensive and prone to overthinking. Latent thinking architectures like Huginn-3.5B address this but lack interpretability and reliable supervision.

Method: Developed a latent classifier to distinguish correct vs incorrect latent thinking patterns, then used it as a Latent Reward Model (LRM) in Latent Thinking Optimization (LTO) algorithm to probabilistically optimize latent reasoning processes.

Result: LTO significantly improved latent thinking processes across diverse reasoning tasks. The latent classifier reliably predicted answer correctness, and LRM effectively detected incorrect thinking patterns while generalizing across domains.

Conclusion: Reward modeling and supervised thinking optimization can be performed directly in latent space, offering a general, efficient, domain-agnostic approach to improve LLM thinking processes compared to verbal thinking methods.

Abstract: Large Language Models (LLMs) excel at problem solving by generating chain of thoughts in natural language, but such verbal thinking is computationally costly and prone to overthinking. Recent work instead proposes a latent thinking architecture Huginn-3.5B, which represents intermediate reasoning steps as sequence of latent representations. However, latent thoughts lack interpretability and are difficult to supervise, raising concerns about the correctness and reliability of its latent thinking processes. In this paper, we provide a systematic study of how Huginn-3.5B thinks in the latent space and how external supervision signals can improve its latent thinking processes. We show that latent thoughts leading to correct versus incorrect answers exhibit highly distinguishable patterns, and that a latent classifier can reliably predict answer correctness directly from latent thoughts. Leveraging these insights, we propose Latent Thinking Optimization (LTO), a probabilistic algorithm that employs the latent classifier as a Latent Reward Model (LRM) to optimize the latent thinking processes. Extensive experiments across diverse reasoning tasks demonstrate that LRM is highly effective in detecting incorrect latent thinking patterns, and LTO can significantly improve the latent thinking processes. Furthermore, we show that LRM can generalize across diverse domains, and LTO can be seamlessly applied to general LLMs to improve their thinking processes. In contrast to verbal thinking, our method demonstrates that reward modeling and scaling test-time thinking with supervision can be performed directly in the latent space, highlighting its potential as a general, efficient, and domain-agnostic approach to improving the thinking processes of LLMs.

[210] Text-Based Approaches to Item Alignment to Content Standards in Large-Scale Reading & Writing Tests

Yanbin Fu, Hong Jiao, Tianyi Zhou, Robert W. Lissitz, Nan Zhang, Ming Li, Qingshu Xu, Sydney Peters

Main category: cs.CL

TL;DR: Fine-tuned small language models (SLMs) outperform embedding-based supervised models for automated item alignment in standardized tests, with better performance when including more item text data.

DetailsMotivation: Human expert alignment of test items to content standards is subjective and time-consuming, requiring automated solutions to improve efficiency and objectivity.

Method: Fine-tuned different SLMs for domain and skill level alignment using data from college admissions reading/writing tests, compared with embedding-based supervised ML models, and conducted semantic similarity analysis.

Result: SLMs consistently outperformed embedding-based models, especially for fine-grained skill alignment. Including more item text data significantly improved performance beyond sample size increases alone.

Conclusion: Fine-tuned SLMs are effective for automated item alignment, with semantic similarity analysis revealing that misclassifications occur when skills are semantically too close.

Abstract: Aligning test items to content standards is a critical step in test development to collect validity evidence based on content. Item alignment has typically been conducted by human experts. This judgmental process can be subjective and time-consuming. This study investigated the performance of fine-tuned small language models (SLMs) for automated item alignment using data from a large-scale standardized reading and writing test for college admissions. Different SLMs were trained for alignment at both domain and skill levels respectively with 10 skills mapped to 4 content domains. The model performance was evaluated in multiple criteria on two testing datasets. The impact of types and sizes of the input data for training was investigated. Results showed that including more item text data led to substantially better model performance, surpassing the improvements induced by sample size increase alone. For comparison, supervised machine learning models were trained using the embeddings from the multilingual-E5-large-instruct model. The study results showed that fine-tuned SLMs consistently outperformed the embedding-based supervised machine learning models, particularly for the more fine-grained skill alignment. To better understand model misclassifications, multiple semantic similarity analysis including pairwise cosine similarity, Kullback-Leibler divergence of embedding distributions, and two-dimension projections of item embeddings were conducted. These analyses consistently showed that certain skills in SAT and PSAT were semantically too close, providing evidence for the observed misclassification.

[211] Verbalized Sampling: How to Mitigate Mode Collapse and Unlock LLM Diversity

Jiayi Zhang, Simon Yu, Derek Chong, Anthony Sicilia, Michael R. Tomz, Christopher D. Manning, Weiyan Shi

Main category: cs.CL

TL;DR: The paper identifies typicality bias in preference data as a key driver of mode collapse in LLM alignment, proposes Verbalized Sampling as a training-free prompting strategy to improve diversity, and demonstrates significant performance gains across various creative tasks.

DetailsMotivation: Post-training alignment often reduces LLM diversity (mode collapse), which previous work attributed to algorithmic limitations. This paper identifies a fundamental data-level driver: typicality bias in preference data where annotators systematically favor familiar text due to cognitive psychology findings.

Method: The authors formalize typicality bias theoretically, verify it empirically on preference datasets, and introduce Verbalized Sampling (VS) - a training-free prompting strategy that asks models to verbalize probability distributions over multiple responses (e.g., “Generate 5 jokes about coffee and their corresponding probabilities”).

Result: VS significantly improves performance across creative writing (poems, stories, jokes), dialogue simulation, open-ended QA, and synthetic data generation, increasing diversity by 1.6-2.1x in creative writing without sacrificing factual accuracy and safety. More capable models benefit more from VS.

Conclusion: The work provides a new data-centric perspective on mode collapse and a practical inference-time remedy (Verbalized Sampling) that helps unlock pre-trained generative diversity in LLMs.

Abstract: Post-training alignment often reduces LLM diversity, leading to a phenomenon known as mode collapse. Unlike prior work that attributes this effect to algorithmic limitations, we identify a fundamental, pervasive data-level driver: typicality bias in preference data, whereby annotators systematically favor familiar text as a result of well-established findings in cognitive psychology. We formalize this bias theoretically, verify it on preference datasets empirically, and show that it plays a central role in mode collapse. Motivated by this analysis, we introduce Verbalized Sampling, a simple, training-free prompting strategy to circumvent mode collapse. VS prompts the model to verbalize a probability distribution over a set of responses (e.g., “Generate 5 jokes about coffee and their corresponding probabilities”). Comprehensive experiments show that VS significantly improves performance across creative writing (poems, stories, jokes), dialogue simulation, open-ended QA, and synthetic data generation, without sacrificing factual accuracy and safety. For instance, in creative writing, VS increases diversity by 1.6-2.1x over direct prompting. We further observe an emergent trend that more capable models benefit more from VS. In sum, our work provides a new data-centric perspective on mode collapse and a practical inference-time remedy that helps unlock pre-trained generative diversity.

[212] Silent Tokens, Loud Effects: Padding in LLMs

Rom Himelstein, Amit LeVi, Yonatan Belinkov, Avi Mendelson

Main category: cs.CL

TL;DR: Padding tokens in LLMs, though meant to be masked, can influence computation due to implementation errors, affecting activations, generation quality, bias, and safety across models like Llama, Gemma, and Qwen.

DetailsMotivation: To systematically study the unintended influence of padding tokens in LLMs, as their effects on computation are not well understood despite their widespread use for sequence length equalization.

Method: Controlled insertion of padding tokens across three open-source model families (Llama, Gemma, Qwen), evaluating outcomes along four axes: activations, generation quality, bias, and safety.

Result: Even small amounts of padding shift hidden representations, degrade quality in smaller models, alter bias unpredictably, and weaken safety guardrails.

Conclusion: Padding is not a harmless detail but a robustness risk that must be carefully handled in deployment.

Abstract: Padding tokens are widely used in large language models (LLMs) to equalize sequence lengths during batched inference. While they should be fully masked, implementation errors can cause them to influence computation, and the extent of this influence is not well understood. We systematically study this effect across three open-source model families (Llama, Gemma, Qwen), inserting controlled amounts of padding and evaluating outcomes along four axes: activations, generation quality, bias, and safety. Even small amounts of padding shift hidden representations, degrade quality in smaller models, alter bias in unpredictable ways, and weaken safety guardrails. These findings demonstrate that padding is not a harmless detail but a robustness risk that must be carefully handled in deployment.

[213] AMAS: Adaptively Determining Communication Topology for LLM-based Multi-Agent System

Hui Yi Leong, Yuheng Li, Yuqing Wu, Wenwen Ouyang, Wei Zhu, Jiechao Gao, Wei Han

Main category: cs.CL

TL;DR: AMAS is a dynamic multi-agent system framework that uses LLM-based graph design to automatically create optimal agent topologies for specific tasks, outperforming traditional fixed-architecture approaches.

DetailsMotivation: Current multi-agent systems using LLMs are limited by inflexible, hand-crafted graph topologies that lack contextual responsiveness, reducing their effectiveness across diverse workloads.

Method: Introduces AMAS with a dynamic graph designer that autonomously identifies task-specific optimal graph configurations through lightweight LLM adaptation, creating context-sensitive agent pathways.

Result: AMAS systematically exceeds state-of-the-art single-agent and multi-agent approaches across question answering, mathematical deduction, and code generation benchmarks with diverse LLM architectures.

Conclusion: Context-sensitive structural adaptability is a foundational requirement for high-performance LLM multi-agent system deployments.

Abstract: Although large language models (LLMs) have revolutionized natural language processing capabilities, their practical implementation as autonomous multi-agent systems (MAS) for industrial problem-solving encounters persistent barriers. Conventional MAS architectures are fundamentally restricted by inflexible, hand-crafted graph topologies that lack contextual responsiveness, resulting in diminished efficacy across varied academic and commercial workloads. To surmount these constraints, we introduce AMAS, a paradigm-shifting framework that redefines LLM-based MAS through a novel dynamic graph designer. This component autonomously identifies task-specific optimal graph configurations via lightweight LLM adaptation, eliminating the reliance on monolithic, universally applied structural templates. Instead, AMAS exploits the intrinsic properties of individual inputs to intelligently direct query trajectories through task-optimized agent pathways. Rigorous validation across question answering, mathematical deduction, and code generation benchmarks confirms that AMAS systematically exceeds state-of-the-art single-agent and multi-agent approaches across diverse LLM architectures. Our investigation establishes that context-sensitive structural adaptability constitutes a foundational requirement for high-performance LLM MAS deployments.

[214] Format Inertia: A Failure Mechanism of LLMs in Medical Pre-Consultation

Seungseop Lim, Gibaeg Kim, Wooseok Han, Jean Seo, Hyunkyung Lee, Jaehyo Yoo, Eunho Yang

Main category: cs.CL

TL;DR: The paper addresses Format Inertia in LLMs for medical pre-consultation, where models generate repetitive but uninformative questions in long dialogues due to skewed turn-count distributions in training data. A simple data rebalancing method effectively mitigates this issue.

DetailsMotivation: LLMs adapted via SFT for medical pre-consultation suffer from Format Inertia—repetitive, format-correct but diagnostically uninformative questions—due to skewed turn-count distributions in training datasets.

Method: A data-centric approach that rebalances the turn-count distribution of the training dataset to address the skewed distribution causing Format Inertia.

Result: Experimental results demonstrate that the rebalancing method substantially alleviates Format Inertia in medical pre-consultation dialogues.

Conclusion: Rebalancing turn-count distributions in training data effectively mitigates Format Inertia, improving the quality of LLM-generated medical dialogues by reducing repetitive and uninformative questioning.

Abstract: Recent advances in Large Language Models (LLMs) have brought significant improvements to various service domains, including chatbots and medical pre-consultation applications. In the healthcare domain, the most common approach for adapting LLMs to multi-turn dialogue generation is Supervised Fine-Tuning (SFT). However, datasets for SFT in tasks like medical pre-consultation typically exhibit a skewed turn-count distribution. Training on such data induces a novel failure mechanism we term Format Inertia, where models tend to generate repetitive, format-correct, but diagnostically uninformative questions in long medical dialogues. To mitigate this observed failure mechanism, we adopt a simple, data-centric method that rebalances the turn-count distribution of the training dataset. Experimental results show that our approach substantially alleviates Format Inertia in medical pre-consultation.

[215] Veri-R1: Toward Precise and Faithful Claim Verification via Online Reinforcement Learning

Qi He, Cheng Qian, Xiusi Chen, Bingxiang He, Yi R. Fung, Heng Ji

Main category: cs.CL

TL;DR: Veri-R1 is an online reinforcement learning framework that trains LLMs to interact with search engines for claim verification, improving accuracy by up to 30% and doubling evidence scores compared to traditional methods.

DetailsMotivation: Existing claim verification approaches rely on prompt engineering or pre-designed workflows without unified training, lacking the ability to improve necessary skills through interaction with retrieval systems.

Method: Online reinforcement learning framework where LLMs interact with search engines and receive reward signals to shape planning, retrieval, and reasoning behaviors, enabling dynamic interaction that reflects real-world verification scenarios.

Result: Veri-R1 improves joint accuracy by up to 30% and doubles evidence score, often surpassing larger-scale model counterparts. Ablation studies reveal impact of reward components and link between output logits and label accuracy.

Conclusion: Online RL is effective for precise and faithful claim verification, providing important foundation for future research in LLM-empowered verification systems.

Abstract: Claim verification with large language models (LLMs) has recently attracted growing attention, due to their strong reasoning capabilities and transparent verification processes compared to traditional answer-only judgments. However, existing approaches to online claim verification, which requires iterative evidence retrieval and reasoning, still mainly rely on prompt engineering or pre-designed reasoning workflows, without unified training to improve necessary skills. Therefore, we introduce Veri-R1, an online reinforcement learning (RL) framework that enables an LLM to interact with a search engine and to receive reward signals that explicitly shape its planning, retrieval, and reasoning behaviors. This dynamic interaction of LLM with retrieval systems more accurately reflects real-world verification scenarios and fosters comprehensive verification skills. Empirical results show that Veri-R1 improves joint accuracy by up to 30% and doubles the evidence score, often surpassing its larger-scale model counterparts. Ablation studies further reveal the impact of reward components, and the link between output logits and label accuracy. Our results highlight the effectiveness of online RL for precise and faithful claim verification, providing an important foundation for future research. We release our code to support community progress in LLM empowered claim verification.

[216] InfoMosaic-Bench: Evaluating Multi-Source Information Seeking in Tool-Augmented Agents

Yaxin Du, Yuanshuo Zhang, Xiyuan Yang, Yifan Zhou, Cheng Wang, Gongyi Zou, Xianghe Pang, Wenhao Wang, Menglan Chen, Shuo Tang, Zhiyu Li, Feiyu Xiong, Siheng Chen

Main category: cs.CL

TL;DR: InfoMosaic-Bench is the first benchmark for multi-source information seeking in tool-augmented agents, requiring agents to combine general-purpose web search with domain-specific tools across six domains.

DetailsMotivation: Existing LLM agents rely heavily on noisy and unreliable open-web search, while many real-world tasks require precise domain-specific knowledge. The emergence of Model Context Protocol (MCP) enables agents to access specialized tools, but it's unclear if agents can effectively leverage these tools and integrate them with general search.

Method: Created InfoMosaic-Bench covering six domains (medicine, finance, maps, video, web, and multi-domain integration) using InfoMosaic-Flow pipeline that grounds task conditions in verified tool outputs, enforces cross-source dependencies, and filters out trivial cases.

Result: Experiments with 14 state-of-the-art LLM agents show: web information alone is insufficient (38.2% accuracy, 67.5% pass rate); domain tools provide selective but inconsistent benefits; 22.4% of failures come from incorrect tool usage or selection.

Conclusion: Current LLMs still struggle with basic tool handling and effectively integrating multiple information sources, highlighting the need for improved tool-augmented agent capabilities.

Abstract: Information seeking is a fundamental requirement for humans. However, existing LLM agents rely heavily on open-web search, which exposes two fundamental weaknesses: online content is noisy and unreliable, and many real-world tasks require precise, domain-specific knowledge unavailable from the web. The emergence of the Model Context Protocol (MCP) now allows agents to interface with thousands of specialized tools, seemingly resolving this limitation. Yet it remains unclear whether agents can effectively leverage such tools – and more importantly, whether they can integrate them with general-purpose search to solve complex tasks. Therefore, we introduce InfoMosaic-Bench, the first benchmark dedicated to multi-source information seeking in tool-augmented agents. Covering six representative domains (medicine, finance, maps, video, web, and multi-domain integration), InfoMosaic-Bench requires agents to combine general-purpose search with domain-specific tools. Tasks are synthesized with InfoMosaic-Flow, a scalable pipeline that grounds task conditions in verified tool outputs, enforces cross-source dependencies, and filters out shortcut cases solvable by trivial lookup. This design guarantees both reliability and non-triviality. Experiments with 14 state-of-the-art LLM agents reveal three findings: (i) web information alone is insufficient, with GPT-5 achieving only 38.2% accuracy and 67.5% pass rate; (ii) domain tools provide selective but inconsistent benefits, improving some domains while degrading others; and (iii) 22.4% of failures arise from incorrect tool usage or selection, highlighting that current LLMs still struggle with even basic tool handling.

[217] Beyond Manuals and Tasks: Instance-Level Context Learning for LLM Agents

Kuntai Cai, Juncheng Liu, Xianglin Yang, Zhaojie Niu, Xiaokui Xiao, Xing Chen

Main category: cs.CL

TL;DR: The paper introduces Instance-Level Context Learning (ILCL) as a crucial third type of context for LLM agents, addressing the gap between environment-level manuals and task-level guidance by focusing on verifiable, reusable facts specific to environment instances.

DetailsMotivation: Current LLM agents often fail in complex tasks because they lack instance-level context - verifiable facts about specific environment instances like object locations, crafting recipes, and local rules that are essential for successful decision-making.

Method: Proposes a task-agnostic ILCL method using guided exploration with a compact TODO forest to prioritize actions and a lightweight plan-act-extract loop to execute them, automatically generating high-precision reusable context documents.

Result: Experiments across TextWorld, ALFWorld, and Crafter show significant improvements: ReAct’s success rate in TextWorld increased from 37% to 95%, and IGE improved from 81% to 95%, with gains in both success and efficiency.

Conclusion: By transforming one-off exploration into persistent, reusable knowledge, ILCL complements existing contexts to enable more reliable and efficient LLM agents, amortizing exploration costs across multiple downstream tasks.

Abstract: Large language model (LLM) agents typically receive two kinds of context: (i) environment-level manuals that define interaction interfaces and global rules, and (ii) task-level guidance or demonstrations tied to specific goals. In this work, we identify a crucial but overlooked third type of context, instance-level context, which consists of verifiable and reusable facts tied to a specific environment instance, such as object locations, crafting recipes, and local rules. We argue that the absence of instance-level context is a common source of failure for LLM agents in complex tasks, as success often depends not only on reasoning over global rules or task prompts but also on making decisions based on precise and persistent facts. Acquiring such context requires more than memorization: the challenge lies in efficiently exploring, validating, and formatting these facts under tight interaction budgets. We formalize this problem as Instance-Level Context Learning (ILCL) and introduce our task-agnostic method to solve it. Our method performs a guided exploration, using a compact TODO forest to intelligently prioritize its next actions and a lightweight plan-act-extract loop to execute them. This process automatically produces a high-precision context document that is reusable across many downstream tasks and agents, thereby amortizing the initial exploration cost. Experiments across TextWorld, ALFWorld, and Crafter demonstrate consistent gains in both success and efficiency: for instance, ReAct’s mean success rate in TextWorld rises from 37% to 95%, while IGE improves from 81% to 95%. By transforming one-off exploration into persistent, reusable knowledge, our method complements existing contexts to enable more reliable and efficient LLM agents.

[218] Pretraining with hierarchical memories: separating long-tail and common knowledge

Hadi Pouransari, David Grangier, C Thomas, Michael Kirchhof, Oncel Tuzel

Main category: cs.CL

TL;DR: Small language models augmented with hierarchical parametric memory banks achieve comparable performance to much larger models by storing world knowledge in external memory rather than model parameters.

DetailsMotivation: Current language models require scaling parameters to store world knowledge, which is inefficient and impractical for edge devices with limited memory and compute. Only a fraction of knowledge is used per prompt.

Method: Memory-augmented architecture with hierarchical parametric memory banks. Small language models fetch context-dependent memory blocks during pretraining and inference. Pretraining learns to store long-tail knowledge in memory while the small model captures common knowledge and reasoning.

Result: A 160M-parameter model with 18M memory from a 4.6B memory bank performs comparably to models with 2x+ parameters. Hierarchical feed-forward memories work robustly across transformers, scaling to over 21B parameters.

Conclusion: Memory-augmented architectures enable efficient knowledge storage and access, making small models competitive with much larger ones while being practical for resource-constrained devices.

Abstract: The impressive performance gains of modern language models currently rely on scaling parameters: larger models store more world knowledge and reason better. Yet compressing all world knowledge into parameters is unnecessary, as only a fraction is used per prompt, and impractical for edge devices with limited inference-time memory and compute. We address this shortcoming by a memory-augmented architecture and a pretraining strategy aligned with existing hardware paradigms. We introduce small language models that access large hierarchical parametric memory banks encoding world knowledge. During pretraining and inference, we fetch a small, context-dependent memory block and add it to the model. Our pretraining learns to store long-tail world knowledge in the memory parameters, while the small language model acts as an anchor capturing common knowledge and general reasoning abilities. Through trillion-token-scale experiments, we show significant gains: a 160M-parameters model augmented with an 18M-parameters memory fetched from a 4.6B memory bank obtains comparable performance to a regular model with more than 2x the parameters. Through extensive experiments, we study the optimal type and size of parametric memories in transformers, scaling them to over 21B parameters. We find that our proposed hierarchical feed-forward memories work robustly across transformer architectures, whether added during pretraining or post-hoc.

[219] Semantic Similarity in Radiology Reports via LLMs and NER

Beth Pearson, Ahmed Adnan, Zahraa S. Abdallah

Main category: cs.CL

TL;DR: Proposes Llama-EntScore, a semantic similarity scoring method combining Llama 3.1 and NER with tunable weights to compare radiology reports, achieving 67% exact-match accuracy compared to radiologist scores.

DetailsMotivation: Identifying semantic differences between preliminary and final radiology reports is essential for junior radiologists' training and to uncover clinical knowledge gaps, but existing LLM approaches face challenges due to specialized domain knowledge requirements.

Method: Combines Llama 3.1 with Named-Entity-Recognition (NER) using tunable weights to emphasize/de-emphasize specific types of differences, generating both quantitative similarity scores and interpretable feedback.

Result: Achieves 67% exact-match accuracy and 93% accuracy within +/- 1 compared to radiologist-provided ground truth scores, outperforming both LLMs and NER used independently.

Conclusion: The proposed Llama-EntScore method effectively addresses limitations of standalone LLM and NER approaches for radiology report comparison, providing accurate semantic similarity assessment with interpretable feedback for training purposes.

Abstract: Radiology report evaluation is a crucial part of radiologists’ training and plays a key role in ensuring diagnostic accuracy. As part of the standard reporting workflow, a junior radiologist typically prepares a preliminary report, which is then reviewed and edited by a senior radiologist to produce the final report. Identifying semantic differences between preliminary and final reports is essential for junior doctors, both as a training tool and to help uncover gaps in clinical knowledge. While AI in radiology is a rapidly growing field, the application of large language models (LLMs) remains challenging due to the need for specialised domain knowledge. In this paper, we explore the ability of LLMs to provide explainable and accurate comparisons of reports in the radiology domain. We begin by comparing the performance of several LLMs in comparing radiology reports. We then assess a more traditional approach based on Named-Entity-Recognition (NER). However, both approaches exhibit limitations in delivering accurate feedback on semantic similarity. To address this, we propose Llama-EntScore, a semantic similarity scoring method using a combination of Llama 3.1 and NER with tunable weights to emphasise or de-emphasise specific types of differences. Our approach generates a quantitative similarity score for tracking progress and also gives an interpretation of the score that aims to offer valuable guidance in reviewing and refining their reporting. We find our method achieves 67% exact-match accuracy and 93% accuracy within +/- 1 when compared to radiologist-provided ground truth scores - outperforming both LLMs and NER used independently. Code is available at: https://github.com/otmive/llama_reports

[220] SurveyBench: Can LLM(-Agents) Write Academic Surveys that Align with Reader Needs?

Zhaojun Sun, Xuzhou Zhu, Xuanhe Zhou, Xin Tong, Shuo Wang, Jie Fu, Guoliang Li, Zhiyuan Liu, Fan Wu

Main category: cs.CL

TL;DR: SurveyBench is a new evaluation framework that uses quiz-driven metrics to assess the quality of automatically generated academic surveys, revealing significant gaps between LLM-generated surveys and human-written ones.

DetailsMotivation: Current automated survey generation methods (LLM4Survey) produce outputs that fall short of human standards, and there's a lack of rigorous benchmarks to properly evaluate their deficiencies from a reader's perspective.

Method: Created SurveyBench with: (1) survey topics from 11,343 arXiv papers and 4,947 high-quality surveys; (2) multifaceted metric hierarchy assessing outline quality, content quality, and non-textual richness; (3) dual-mode evaluation protocol with content-based and quiz-based answerability tests aligned with readers’ needs.

Result: SurveyBench effectively challenges existing LLM4Survey approaches, showing they perform on average 21% lower than human-written surveys in content-based evaluation.

Conclusion: The proposed evaluation framework successfully reveals the deficiencies in current automated survey generation methods and provides a more rigorous, reader-aligned benchmark for assessing survey quality.

Abstract: Academic survey writing, which distills vast literature into a coherent and insightful narrative, remains a labor-intensive and intellectually demanding task. While recent approaches, such as general DeepResearch agents and survey-specialized methods, can generate surveys automatically (a.k.a. LLM4Survey), their outputs often fall short of human standards and there lacks a rigorous, reader-aligned benchmark for thoroughly revealing their deficiencies. To fill the gap, we propose a fine-grained, quiz-driven evaluation framework SurveyBench, featuring (1) typical survey topics source from recent 11,343 arXiv papers and corresponding 4,947 high-quality surveys; (2) a multifaceted metric hierarchy that assesses the outline quality (e.g., coverage breadth, logical coherence), content quality (e.g., synthesis granularity, clarity of insights), and non-textual richness; and (3) a dual-mode evaluation protocol that includes content-based and quiz-based answerability tests, explicitly aligned with readers’ informational needs. Results show SurveyBench effectively challenges existing LLM4Survey approaches (e.g., on average 21% lower than human in content-based evaluation).

cs.CV

[221] SoC-DT: Standard-of-Care Aligned Digital Twins for Patient-Specific Tumor Dynamics

Moinak Bhattacharya, Gagandeep Singh, Prateek Prasanna

Main category: cs.CV

TL;DR: SoC-DT is a differentiable framework that combines reaction-diffusion tumor growth models with standard-of-care interventions and patient personalization to predict post-treatment tumor structure on imaging.

DetailsMotivation: Accurate prediction of tumor trajectories under standard-of-care therapies is essential for treatment planning but conventional models fail to capture tumor dynamics under heterogeneous therapeutic paradigms.

Method: SoC-DT unifies reaction-diffusion tumor growth models with discrete SoC interventions (surgery, chemotherapy, radiotherapy) and genomic/demographic personalization. Uses IMEX-SoC solver for stability and scalability.

Result: SoC-DT consistently outperforms classical PDE baselines and purely data-driven neural models in predicting tumor dynamics on both synthetic and real glioma data.

Conclusion: SoC-DT bridges mechanistic interpretability with modern differentiable solvers, establishing a principled foundation for patient-specific digital twins in oncology with biologically consistent tumor dynamics estimation.

Abstract: Accurate prediction of tumor trajectories under standard-of-care (SoC) therapies remains a major unmet need in oncology. This capability is essential for optimizing treatment planning and anticipating disease progression. Conventional reaction-diffusion models are limited in scope, as they fail to capture tumor dynamics under heterogeneous therapeutic paradigms. There is hence a critical need for computational frameworks that can realistically simulate SoC interventions while accounting for inter-patient variability in genomics, demographics, and treatment regimens. We introduce Standard-of-Care Digital Twin (SoC-DT), a differentiable framework that unifies reaction-diffusion tumor growth models, discrete SoC interventions (surgery, chemotherapy, radiotherapy) along with genomic and demographic personalization to predict post-treatment tumor structure on imaging. An implicit-explicit exponential time-differencing solver, IMEX-SoC, is also proposed, which ensures stability, positivity, and scalability in SoC treatment situations. Evaluated on both synthetic data and real world glioma data, SoC-DT consistently outperforms classical PDE baselines and purely data-driven neural models in predicting tumor dynamics. By bridging mechanistic interpretability with modern differentiable solvers, SoC-DT establishes a principled foundation for patient-specific digital twins in oncology, enabling biologically consistent tumor dynamics estimation. Code will be made available upon acceptance.

[222] Enhancing Fake News Video Detection via LLM-Driven Creative Process Simulation

Yuyan Bu, Qiang Sheng, Juan Cao, Shaofei Wang, Peng Qi, Yuhui Shi, Beizhe Hu

Main category: cs.CV

TL;DR: AgentAug is a data augmentation framework that generates diverse fake news videos using LLM-driven pipelines to simulate creative fabrication processes, improving fake news detection performance on short video platforms.

DetailsMotivation: Current fake news detectors suffer from biased patterns due to limited and undiversified training data, failing to capture the complex many-to-many relationships between video segments and fabricated news events in real-world scenarios.

Method: Proposes AgentAug with multiple LLM-driven pipelines simulating four fabrication categories for news video creation, combined with active learning based on uncertainty sampling to select useful augmented samples during training.

Result: Experimental results on two benchmark datasets show that AgentAug consistently improves the performance of short video fake news detectors.

Conclusion: AgentAug effectively addresses the data scarcity and diversity issues in fake news video detection by generating realistic augmented data that captures the complex relationships in real-world fake news creation.

Abstract: The emergence of fake news on short video platforms has become a new significant societal concern, necessitating automatic video-news-specific detection. Current detectors primarily rely on pattern-based features to separate fake news videos from real ones. However, limited and less diversified training data lead to biased patterns and hinder their performance. This weakness stems from the complex many-to-many relationships between video material segments and fabricated news events in real-world scenarios: a single video clip can be utilized in multiple ways to create different fake narratives, while a single fabricated event often combines multiple distinct video segments. However, existing datasets do not adequately reflect such relationships due to the difficulty of collecting and annotating large-scale real-world data, resulting in sparse coverage and non-comprehensive learning of the characteristics of potential fake news video creation. To address this issue, we propose a data augmentation framework, AgentAug, that generates diverse fake news videos by simulating typical creative processes. AgentAug implements multiple LLM-driven pipelines of four fabrication categories for news video creation, combined with an active learning strategy based on uncertainty sampling to select the potentially useful augmented samples during training. Experimental results on two benchmark datasets demonstrate that AgentAug consistently improves the performance of short video fake news detectors.

[223] Visualizing Celebrity Dynamics in Video Content: A Proposed Approach Using Face Recognition Timestamp Data

Doğanay Demir, İlknur Durgar Elkahlout

Main category: cs.CV

TL;DR: A hybrid framework combining distributed multi-GPU inference with interactive visualization for analyzing celebrity dynamics in video content, providing multi-dimensional insights through various visualizations.

DetailsMotivation: To understand video structure and dynamics in an era dominated by video content, particularly focusing on celebrity appearances and relationships in episodic content.

Method: Combines distributed multi-GPU inference system with optimized ONNX models, heterogeneous batch inference, and high-throughput parallelism for efficient video processing, followed by interactive visualization platform with multiple chart types.

Result: Successfully generates timestamped appearance records and provides comprehensive visualizations including frequency charts, duration analyses, co-appearance matrices, network graphs, and heatmaps for multi-dimensional insights.

Conclusion: The framework enables new possibilities for entertainment analytics, content creation strategies, and audience engagement studies by bridging distributed recognition with structured, visually-driven analytics.

Abstract: In an era dominated by video content, understanding its structure and dynamics has become increasingly important. This paper presents a hybrid framework that combines a distributed multi-GPU inference system with an interactive visualization platform for analyzing celebrity dynamics in video episodes. The inference framework efficiently processes large volumes of video data by leveraging optimized ONNX models, heterogeneous batch inference, and high-throughput parallelism, ensuring scalable generation of timestamped appearance records. These records are then transformed into a comprehensive suite of visualizations, including appearance frequency charts, duration analyses, pie charts, co-appearance matrices, network graphs, stacked area charts, seasonal comparisons, and heatmaps. Together, these visualizations provide multi-dimensional insights into video content, revealing patterns in celebrity prominence, screen-time distribution, temporal dynamics, co-appearance relationships, and intensity across episodes and seasons. The interactive nature of the system allows users to dynamically explore data, identify key moments, and uncover evolving relationships between individuals. By bridging distributed recognition with structured, visually-driven analytics, this work enables new possibilities for entertainment analytics, content creation strategies, and audience engagement studies.

[224] Domain-Robust Marine Plastic Detection Using Vision Models

Saanvi Kataria

Main category: cs.CV

TL;DR: This paper benchmarks cross-domain underwater plastic detection, finding lightweight CNNs like MobileNetV2 outperform larger models and zero-shot vision-language models in generalization.

DetailsMotivation: Marine plastic pollution requires reliable automation, but vision systems degrade due to domain shift when applied to new underwater imagery.

Method: Trained CNNs (MobileNetV2, ResNet-18, EfficientNet-B0) and vision transformers (DeiT-Tiny, ViT-B16) on labeled underwater data, evaluated on cross-domain test set. Also assessed zero-shot models CLIP ViT-L14 and Gemini 2.0 Flash.

Result: MobileNetV2 achieved strongest cross-domain performance (F1 0.97). All fine-tuned models had high Precision (~99%) but varied Recall. Zero-shot CLIP had Recall ~80% but Precision ~56%, while Gemini showed inverse profile (Precision ~99%, Recall ~81%).

Conclusion: Compact CNNs with supervised training generalize effectively for cross-domain underwater detection, while large pretrained vision-language models provide complementary strengths.

Abstract: Marine plastic pollution is a pressing environmental threat, making reliable automation for underwater debris detection essential. However, vision systems trained on one dataset often degrade on new imagery due to domain shift. This study benchmarks models for cross-domain robustness, training convolutional neural networks - CNNs (MobileNetV2, ResNet-18, EfficientNet-B0) and vision transformers (DeiT-Tiny, ViT-B16) on a labeled underwater dataset and then evaluates them on a balanced cross-domain test set built from plastic-positive images drawn from a different source and negatives from the training domain. Two zero-shot models were assessed, CLIP ViT-L14 and Google’s Gemini 2.0 Flash, that leverage pretraining to classify images without fine-tuning. Results show the lightweight MobileNetV2 delivers the strongest cross-domain performance (F1 0.97), surpassing larger models. All fine-tuned models achieved high Precision (around 99%), but differ in Recall, indicating varying sensitivity to plastic instances. Zero-shot CLIP is comparatively sensitive (Recall around 80%) yet prone to false positives (Precision around 56%), whereas Gemini exhibits the inverse profile (Precision around 99%, Recall around 81%). Error analysis highlights recurring confusions with coral textures, suspended particulates, and specular glare. Overall, compact CNNs with supervised training can generalize effectively for cross-domain underwater detection, while large pretrained vision-language models provide complementary strengths.

[225] SFANet: Spatial-Frequency Attention Network for Deepfake Detection

Vrushank Ahire, Aniruddh Muley, Shivam Zample, Siddharth Verma, Pranav Menon, Surbhi Madan, Abhinav Dhall

Main category: cs.CV

TL;DR: A novel ensemble framework combining transformers and texture-based methods achieves state-of-the-art deepfake detection performance on diverse datasets through innovative training and feature enhancement techniques.

DetailsMotivation: Existing deepfake detection methods fail to generalize across diverse datasets and generation techniques, creating a pressing need for more robust solutions.

Method: Ensemble framework combining Swin Transformers and ViTs with texture-based methods, using data-splitting, sequential training, frequency splitting, patch-based attention, and face segmentation techniques.

Result: Achieves state-of-the-art performance on DFWild-Cup dataset (diverse subset of eight deepfake datasets), demonstrating improved accuracy and robustness.

Conclusion: Hybrid models combining transformers and texture-based methods can effectively address deepfake detection challenges, offering robust real-world solutions through complementary feature extraction approaches.

Abstract: Detecting manipulated media has now become a pressing issue with the recent rise of deepfakes. Most existing approaches fail to generalize across diverse datasets and generation techniques. We thus propose a novel ensemble framework, combining the strengths of transformer-based architectures, such as Swin Transformers and ViTs, and texture-based methods, to achieve better detection accuracy and robustness. Our method introduces innovative data-splitting, sequential training, frequency splitting, patch-based attention, and face segmentation techniques to handle dataset imbalances, enhance high-impact regions (e.g., eyes and mouth), and improve generalization. Our model achieves state-of-the-art performance when tested on the DFWild-Cup dataset, a diverse subset of eight deepfake datasets. The ensemble benefits from the complementarity of these approaches, with transformers excelling in global feature extraction and texturebased methods providing interpretability. This work demonstrates that hybrid models can effectively address the evolving challenges of deepfake detection, offering a robust solution for real-world applications.

[226] Multimodal Arabic Captioning with Interpretable Visual Concept Integration

Passant Elchafei, Amany Fashwan

Main category: cs.CV

TL;DR: VLCAP is an Arabic image captioning framework that combines CLIP-based visual label retrieval with multimodal text generation, using multilingual encoders and vision-language models to produce culturally coherent Arabic captions.

DetailsMotivation: To create an interpretable Arabic image captioning system that grounds generation in visual concepts rather than relying solely on end-to-end approaches, enabling culturally appropriate and contextually accurate captions.

Method: Two-stage pipeline: 1) Visual label retrieval using three multilingual encoders (mCLIP, AraCLIP, Jina V4) with a hybrid vocabulary from training captions and Visual Genome translations; 2) Caption generation using Qwen-VL and Gemini Pro Vision with retrieved labels as prompts.

Result: Best performance varied by metric: mCLIP + Gemini Pro Vision achieved highest BLEU-1 (5.34%) and cosine similarity (60.01%), while AraCLIP + Qwen-VL obtained best LLM-judge score (36.33%) across six encoder-decoder configurations.

Conclusion: The interpretable pipeline successfully enables culturally coherent and contextually accurate Arabic captions, with different encoder-decoder combinations excelling in different evaluation metrics.

Abstract: We present VLCAP, an Arabic image captioning framework that integrates CLIP-based visual label retrieval with multimodal text generation. Rather than relying solely on end-to-end captioning, VLCAP grounds generation in interpretable Arabic visual concepts extracted with three multilingual encoders, mCLIP, AraCLIP, and Jina V4, each evaluated separately for label retrieval. A hybrid vocabulary is built from training captions and enriched with about 21K general domain labels translated from the Visual Genome dataset, covering objects, attributes, and scenes. The top-k retrieved labels are transformed into fluent Arabic prompts and passed along with the original image to vision-language models. In the second stage, we tested Qwen-VL and Gemini Pro Vision for caption generation, resulting in six encoder-decoder configurations. The results show that mCLIP + Gemini Pro Vision achieved the best BLEU-1 (5.34%) and cosine similarity (60.01%), while AraCLIP + Qwen-VL obtained the highest LLM-judge score (36.33%). This interpretable pipeline enables culturally coherent and contextually accurate Arabic captions.

[227] ReactDiff: Fundamental Multiple Appropriate Facial Reaction Diffusion Model

Luo Cheng, Song Siyang, Yan Siyuan, Yu Zhen, Ge Zongyuan

Main category: cs.CV

TL;DR: ReactDiff is a temporal diffusion framework that generates diverse and realistic facial reactions in dialogues by incorporating spatio-temporal facial kinematics and action unit dependencies.

DetailsMotivation: Existing methods fail to model the stochasticity and dynamics of real human facial reactions, leading to unrealistic outputs with artifacts like jitters and unnatural expressions.

Method: Proposes ReactDiff framework that incorporates two priors: temporal facial behavioral kinematics and facial action unit dependencies to guide the diffusion process toward realistic human reaction manifolds.

Result: Extensive experiments on REACT2024 dataset show state-of-the-art reaction quality, diversity, and reaction appropriateness compared to existing methods.

Conclusion: ReactDiff successfully addresses the challenge of generating diverse and human-like facial reactions by incorporating anatomical and temporal constraints, achieving superior performance in both quality and diversity.

Abstract: The automatic generation of diverse and human-like facial reactions in dyadic dialogue remains a critical challenge for human-computer interaction systems. Existing methods fail to model the stochasticity and dynamics inherent in real human reactions. To address this, we propose ReactDiff, a novel temporal diffusion framework for generating diverse facial reactions that are appropriate for responding to any given dialogue context. Our key insight is that plausible human reactions demonstrate smoothness, and coherence over time, and conform to constraints imposed by human facial anatomy. To achieve this, ReactDiff incorporates two vital priors (spatio-temporal facial kinematics) into the diffusion process: i) temporal facial behavioral kinematics and ii) facial action unit dependencies. These two constraints guide the model toward realistic human reaction manifolds, avoiding visually unrealistic jitters, unstable transitions, unnatural expressions, and other artifacts. Extensive experiments on the REACT2024 dataset demonstrate that our approach not only achieves state-of-the-art reaction quality but also excels in diversity and reaction appropriateness.

[228] Convolutional Neural Nets vs Vision Transformers: A SpaceNet Case Study with Balanced vs Imbalanced Regimes

Akshar Gothi

Main category: cs.CV

TL;DR: Comparison of EfficientNet-B0 (CNN) and ViT-Base (Vision Transformer) on SpaceNet dataset under imbalanced and balanced label distributions, showing CNNs maintain efficiency advantages while both architectures perform well on balanced data.

DetailsMotivation: To conduct a controlled comparison between convolutional neural networks and vision transformers under different label distribution regimes to understand their relative performance and efficiency characteristics.

Method: Used SpaceNet dataset with two label-distribution regimes: naturally imbalanced five-class split and balanced-resampled split (700 images per class). Applied matched preprocessing (224x224, ImageNet normalization), lightweight augmentations, and 40-epoch training budget on single NVIDIA P100 GPU.

Result: On imbalanced split: EfficientNet-B0 achieved 93% test accuracy with strong macro-F1 and lower latency; ViT-Base was competitive at 93% but with larger parameter count and runtime. On balanced split: Both models performed strongly, with EfficientNet-B0 reaching 99% accuracy while ViT-Base remained competitive.

Conclusion: Balancing label distributions narrows architecture performance gaps, but CNNs retain efficiency advantages in terms of model size and latency. Both architectures can achieve strong performance when data is properly balanced.

Abstract: We present a controlled comparison of a convolutional neural network (EfficientNet-B0) and a Vision Transformer (ViT-Base) on SpaceNet under two label-distribution regimes: a naturally imbalanced five-class split and a balanced-resampled split with 700 images per class (70:20:10 train/val/test). With matched preprocessing (224x224, ImageNet normalization), lightweight augmentations, and a 40-epoch budget on a single NVIDIA P100, we report accuracy, macro-F1, balanced accuracy, per-class recall, and deployment metrics (model size and latency). On the imbalanced split, EfficientNet-B0 reaches 93% test accuracy with strong macro-F1 and lower latency; ViT-Base is competitive at 93% with a larger parameter count and runtime. On the balanced split, both models are strong; EfficientNet-B0 reaches 99% while ViT-Base remains competitive, indicating that balancing narrows architecture gaps while CNNs retain an efficiency edge. We release manifests, logs, and per-image predictions to support reproducibility.

[229] ExposureEngine: Oriented Logo Detection and Sponsor Visibility Analytics in Sports Broadcasts

Mehdi Houshmand Sarkhoosh, Frøy Øye, Henrik Nestor Sørlie, Nam Hoang Vu, Dag Johansen, Cise Midoglu, Tomas Kupka, Pål Halvorsen

Main category: cs.CV

TL;DR: ExposureEngine is an automated system that uses Oriented Bounding Boxes (OBB) for accurate sponsor logo detection in sports broadcasts, overcoming limitations of traditional horizontal bounding boxes that fail with rotated logos.

DetailsMotivation: Traditional sponsor visibility analysis in sports broadcasts is manual, subjective, and unscalable. Existing automated systems use axis-aligned bounding boxes that are inaccurate for rotated or skewed logos due to dynamic camera angles and perspective distortions.

Method: Developed an end-to-end system that predicts Oriented Bounding Boxes (OBB) for precise logo detection. Created a dataset of 1,103 frames from Swedish elite soccer with 670 unique sponsor logos annotated with OBBs. Integrated language-driven agentic layer for natural language queries and report generation.

Result: Achieved mean Average Precision (mAP@0.5) of 0.859, with precision of 0.96 and recall of 0.87. System provides precise visibility metrics including exposure duration and on-screen coverage. Complete analytics dashboard enables auditable and interpretable sponsor measurement.

Conclusion: ExposureEngine provides a comprehensive, automated solution for accurate sponsor visibility analytics in sports broadcasts using rotation-aware detection, overcoming limitations of traditional methods and enabling scalable, precise measurement.

Abstract: Quantifying sponsor visibility in sports broadcasts is a critical marketing task traditionally hindered by manual, subjective, and unscalable analysis methods. While automated systems offer an alternative, their reliance on axis-aligned Horizontal Bounding Box (HBB) leads to inaccurate exposuremetrics when logos appear rotated or skewed due to dynamic camera angles and perspective distortions. This paper introduces ExposureEngine, an end-to-end system designed for accurate, rotation-aware sponsor visibility analytics in sports broadcasts, demonstrated in a soccer case study. Our approach predicts Oriented Bounding Box (OBB) to provide a geometrically precise fit to each logo regardless of the orientation on-screen. To train and evaluate our detector, we developed a new dataset comprising 1,103 frames from Swedish elite soccer, featuring 670 unique sponsor logos annotated with OBBs. Our model achieves a mean Average Precision (mAP@0.5) of 0.859, with a precision of 0.96 and recall of 0.87, demonstrating robust performance in localizing logos under diverse broadcast conditions. The system integrates these detections into an analytical pipeline that calculates precise visibility metrics, such as exposure duration and on-screen coverage. Furthermore, we incorporate a language-driven agentic layer, enabling users to generate reports, summaries, and media content through natural language queries. The complete system, including the dataset and the analytics dashboard, provides a comprehensive solution for auditable and interpretable sponsor measurement in sports media. An overview of the ExposureEngine is available online: https://youtu.be/tRw6OBISuW4 .

[230] A Comprehensive Review on Artificial Intelligence Empowered Solutions for Enhancing Pedestrian and Cyclist Safety

Shucheng Zhang, Yan Shi, Bingzhang Wang, Yuang Zhang, Muhammad Monjurul Karim, Kehua Chen, Chenxi Liu, Mehrdad Nasri, Yinhai Wang

Main category: cs.CV

TL;DR: This paper provides a comprehensive review of camera-based AI sensing systems for vulnerable road user (VRU) safety, covering detection, tracking, trajectory prediction, and intent recognition.

DetailsMotivation: Existing surveys on AI for VRU safety focus mainly on detection, leaving gaps in other essential vision-based tasks needed for comprehensive VRU protection in dynamic urban environments.

Method: The authors systematically examine four core AI tasks: detection and classification, tracking and reidentification, trajectory prediction, and intent recognition and prediction, with emphasis on developments from the past five years.

Result: The review identifies emerging research trends and provides a foundational reference linking visual AI advances with practical considerations for real-world implementation in intelligent transportation systems.

Conclusion: The survey highlights four major open challenges from data, model, and deployment perspectives to guide future research in developing next-generation sensing systems for enhanced VRU safety.

Abstract: Ensuring the safety of vulnerable road users (VRUs), such as pedestrians and cyclists, remains a critical global challenge, as conventional infrastructure-based measures often prove inadequate in dynamic urban environments. Recent advances in artificial intelligence (AI), particularly in visual perception and reasoning, open new opportunities for proactive and context-aware VRU protection. However, existing surveys on AI applications for VRUs predominantly focus on detection, offering limited coverage of other vision-based tasks that are essential for comprehensive VRU understanding and protection. This paper presents a state-of-the-art review of recent progress in camera-based AI sensing systems for VRU safety, with an emphasis on developments from the past five years and emerging research trends. We systematically examine four core tasks, namely detection and classification, tracking and reidentification, trajectory prediction, and intent recognition and prediction, which together form the backbone of AI-empowered proactive solutions for VRU protection in intelligent transportation systems. To guide future research, we highlight four major open challenges from the perspectives of data, model, and deployment. By linking advances in visual AI with practical considerations for real-world implementation, this survey aims to provide a foundational reference for the development of next-generation sensing systems to enhance VRU safety.

[231] Paper2Video: Automatic Video Generation from Scientific Papers

Zeyu Zhu, Kevin Qinghong Lin, Mike Zheng Shou

Main category: cs.CV

TL;DR: PaperTalker is a multi-agent framework that automates academic presentation video generation from research papers, addressing challenges like multi-modal content coordination and dense information presentation.

DetailsMotivation: Academic presentation video production is highly labor-intensive, requiring hours of work for short videos. Current methods don't adequately handle the distinctive challenges of research paper inputs, dense multi-modal information, and coordination of multiple aligned channels.

Method: Proposes PaperTalker, a multi-agent framework that integrates slide generation with layout refinement using tree search visual choice, cursor grounding, subtitling, speech synthesis, and talking-head rendering. Uses parallel slide-wise generation for efficiency.

Result: Experiments on Paper2Video benchmark show that the generated presentation videos are more faithful and informative than existing baselines, establishing practical automated academic video generation.

Conclusion: PaperTalker represents a significant step toward automated and ready-to-use academic video generation, with the dataset, agent framework, and code made publicly available.

Abstract: Academic presentation videos have become an essential medium for research communication, yet producing them remains highly labor-intensive, often requiring hours of slide design, recording, and editing for a short 2 to 10 minutes video. Unlike natural video, presentation video generation involves distinctive challenges: inputs from research papers, dense multi-modal information (text, figures, tables), and the need to coordinate multiple aligned channels such as slides, subtitles, speech, and human talker. To address these challenges, we introduce PaperTalker, the first benchmark of 101 research papers paired with author-created presentation videos, slides, and speaker metadata. We further design four tailored evaluation metrics–Meta Similarity, PresentArena, PresentQuiz, and IP Memory–to measure how videos convey the paper’s information to the audience. Building on this foundation, we propose PaperTalker, the first multi-agent framework for academic presentation video generation. It integrates slide generation with effective layout refinement by a novel effective tree search visual choice, cursor grounding, subtitling, speech synthesis, and talking-head rendering, while parallelizing slide-wise generation for efficiency. Experiments on Paper2Video demonstrate that the presentation videos produced by our approach are more faithful and informative than existing baselines, establishing a practical step toward automated and ready-to-use academic video generation. Our dataset, agent, and code are available at https://github.com/showlab/Paper2Video.

[232] The View From Space: Navigating Instrumentation Differences with EOFMs

Ryan P. Demilt, Nicholas LaHaye, Karis Tenneson

Main category: cs.CV

TL;DR: Earth Observation Foundation Models (EOFMs) are sensitive to sensor architecture differences, which affects their representation spaces and highlights limitations in current model design.

DetailsMotivation: To understand how diverse sensor architectures impact the internal representations of Earth Observation Foundation Models, as most current models are trained on single modalities and applied across different sensors.

Method: Analysis of EOFM representation spaces to demonstrate sensitivity to sensor architecture differences.

Result: EOFMs’ representation spaces are highly sensitive to sensor architecture, revealing significant differences in how models process data from different sensors.

Conclusion: Understanding sensor architecture sensitivity provides crucial insights for improving EOFM design and guiding robust remote-sensing science practices.

Abstract: Earth Observation Foundation Models (EOFMs) have exploded in prevalence as tools for processing the massive volumes of remotely sensed and other earth observation data, and for delivering impact on the many essential earth monitoring tasks. An emerging trend posits using the outputs of pre-trained models as ’embeddings’ which summarize high dimensional data to be used for generic tasks such as similarity search and content-specific queries. However, most EOFM models are trained only on single modalities of data and then applied or benchmarked by matching bands across different modalities. It is not clear from existing work what impact diverse sensor architectures have on the internal representations of the present suite of EOFMs. We show in this work that the representation space of EOFMs is highly sensitive to sensor architecture and that understanding this difference gives a vital perspective on the pitfalls of current EOFM design and signals for how to move forward as model developers, users, and a community guided by robust remote-sensing science.

[233] Photorealistic Inpainting for Perturbation-based Explanations in Ecological Monitoring

Günel Aghakishiyeva, Jiayi Zhou, Saagar Arya, James David Poling, Holly R. Houliston, Jamie N. Womble, David W. Johnston, Brinnae Bent

Main category: cs.CV

TL;DR: An inpainting-guided perturbation method generates photorealistic explanations for ecological vision models, revealing fine-grained morphological cues in species recognition tasks.

DetailsMotivation: To address the opacity of automated ecological monitoring models and improve trust/field adoption by providing interpretable, photorealistic explanations.

Method: Uses inpainting-guided perturbation with SAM-refined masks for object removal/replacement and background replacement, applied to YOLOv9 seal detection in drone imagery.

Result: Produces explanations that localize diagnostic structures, avoid deletion artifacts, and provide domain-relevant insights validated by expert review and quantitative metrics.

Conclusion: The approach enables more trustworthy AI deployment in ecology by generating ecologically plausible explanations that support expert validation.

Abstract: Ecological monitoring is increasingly automated by vision models, yet opaque predictions limit trust and field adoption. We present an inpainting-guided, perturbation-based explanation technique that produces photorealistic, mask-localized edits that preserve scene context. Unlike masking or blurring, these edits stay in-distribution and reveal which fine-grained morphological cues drive predictions in tasks such as species recognition and trait attribution. We demonstrate the approach on a YOLOv9 detector fine-tuned for harbor seal detection in Glacier Bay drone imagery, using Segment-Anything-Model-refined masks to support two interventions: (i) object removal/replacement (e.g., replacing seals with plausible ice/water or boats) and (ii) background replacement with original animals composited onto new scenes. Explanations are assessed by re-scoring perturbed images (flip rate, confidence drop) and by expert review for ecological plausibility and interpretability. The resulting explanations localize diagnostic structures, avoid deletion artifacts common to traditional perturbations, and yield domain-relevant insights that support expert validation and more trustworthy deployment of AI in ecology.

[234] A Modular Conditional Diffusion Framework for Image Reconstruction

Magauiya Zhussip, Iaroslav Koshelev, Stamatis Lefkimmiatis

Main category: cs.CV

TL;DR: Proposes DP-IR, a modular diffusion framework for blind image restoration that combines pre-trained IR networks with DPMs, requiring only 0.7M additional parameters and achieving 4x sampling acceleration.

DetailsMotivation: Address the impracticality of existing DPM-based IR solutions due to task-specific nature, high computational costs, and training data requirements that hinder wider adoption.

Method: Modular framework combining pre-trained state-of-the-art IR networks with generative DPMs, requiring training of only a small task-specific module (0.7M parameters) and employing efficient sampling strategies.

Result: Outperforms existing approaches in perceptual quality, maintains competitive fidelity metrics, and achieves at least 4x reduction in neural function evaluations without performance loss.

Conclusion: DP-IR enables practical adoption of DPMs for image restoration by reducing computational requirements while maintaining high performance across multiple IR tasks.

Abstract: Diffusion Probabilistic Models (DPMs) have been recently utilized to deal with various blind image restoration (IR) tasks, where they have demonstrated outstanding performance in terms of perceptual quality. However, the task-specific nature of existing solutions and the excessive computational costs related to their training, make such models impractical and challenging to use for different IR tasks than those that were initially trained for. This hinders their wider adoption, especially by those who lack access to powerful computational resources and vast amount of training data. In this work we aim to address the above issues and enable the successful adoption of DPMs in practical IR-related applications. Towards this goal, we propose a modular diffusion probabilistic IR framework (DP-IR), which allows us to combine the performance benefits of existing pre-trained state-of-the-art IR networks and generative DPMs, while it requires only the additional training of a relatively small module (0.7M params) related to the particular IR task of interest. Moreover, the architecture of the proposed framework allows for a sampling strategy that leads to at least four times reduction of neural function evaluations without suffering any performance loss, while it can also be combined with existing acceleration techniques such as DDIM. We evaluate our model on four benchmarks for the tasks of burst JDD-SR, dynamic scene deblurring, and super-resolution. Our method outperforms existing approaches in terms of perceptual quality while it retains a competitive performance with respect to fidelity metrics.

[235] Advances in Medical Image Segmentation: A Comprehensive Survey with a Focus on Lumbar Spine Applications

Ahmed Kabil, Ghada Khoriba, Mina Yousef, Essam A. Rashed

Main category: cs.CV

TL;DR: This paper provides a comprehensive survey of medical image segmentation methods, covering both traditional techniques and modern deep learning approaches, with a special case study on lumbar spine segmentation.

DetailsMotivation: To bridge the gap between traditional image processing techniques and modern deep learning approaches in medical image segmentation, and to address persistent challenges in the field.

Method: Systematic survey methodology covering thresholding, edge detection, region-based segmentation, clustering algorithms, model-based techniques, CNNs, FCNs, U-Net variants, attention mechanisms, semi-supervised learning, GANs, and Transformer-based models.

Result: The survey comprehensively maps the evolution of MIS methodologies and identifies emerging trends including hybrid architectures, cross-modality learning, federated learning, and active learning strategies.

Conclusion: Despite significant progress, critical challenges persist including dataset bias, domain adaptation, model interpretability, and integration into clinical workflows, requiring continued research and development.

Abstract: Medical Image Segmentation (MIS) stands as a cornerstone in medical image analysis, playing a pivotal role in precise diagnostics, treatment planning, and monitoring of various medical conditions. This paper presents a comprehensive and systematic survey of MIS methodologies, bridging the gap between traditional image processing techniques and modern deep learning approaches. The survey encompasses thresholding, edge detection, region-based segmentation, clustering algorithms, and model-based techniques while also delving into state-of-the-art deep learning architectures such as Convolutional Neural Networks (CNNs), Fully Convolutional Networks (FCNs), and the widely adopted U-Net and its variants. Moreover, integrating attention mechanisms, semi-supervised learning, generative adversarial networks (GANs), and Transformer-based models is thoroughly explored. In addition to covering established methods, this survey highlights emerging trends, including hybrid architectures, cross-modality learning, federated and distributed learning frameworks, and active learning strategies, which aim to address challenges such as limited labeled datasets, computational complexity, and model generalizability across diverse imaging modalities. Furthermore, a specialized case study on lumbar spine segmentation is presented, offering insights into the challenges and advancements in this relatively underexplored anatomical region. Despite significant progress in the field, critical challenges persist, including dataset bias, domain adaptation, interpretability of deep learning models, and integration into real-world clinical workflows.

[236] Textured Gaussians for Enhanced 3D Scene Appearance Modeling

Brian Chao, Hung-Yu Tseng, Lorenzo Porzi, Chen Gao, Tuotuo Li, Qinbo Li, Ayush Saraf, Jia-Bin Huang, Johannes Kopf, Gordon Wetzstein, Changil Kim

Main category: cs.CV

TL;DR: The paper introduces texture mapping to 3D Gaussian Splatting (3DGS) to enhance the expressivity of individual Gaussians beyond simple ellipsoids with uniform colors.

DetailsMotivation: Standard 3DGS limits expressivity as each Gaussian can only represent a simple ellipsoid with uniform color. This restricts the ability to model fine geometric details and texture variations.

Method: The authors integrate texture mapping into 3DGS by augmenting each Gaussian with alpha, RGB, or RGBA texture maps to model spatially varying color and opacity across each Gaussian’s extent.

Result: The method achieves higher image quality than existing approaches while using similar or fewer Gaussians, with alpha-only texture maps significantly improving expressivity and RGB texture maps providing the highest expressivity.

Conclusion: Texture mapping greatly enhances 3DGS expressivity, enabling richer texture patterns and geometric structures while maintaining the efficiency advantages of Gaussian splatting.

Abstract: 3D Gaussian Splatting (3DGS) has recently emerged as a state-of-the-art 3D reconstruction and rendering technique due to its high-quality results and fast training and rendering time. However, pixels covered by the same Gaussian are always shaded in the same color up to a Gaussian falloff scaling factor. Furthermore, the finest geometric detail any individual Gaussian can represent is a simple ellipsoid. These properties of 3DGS greatly limit the expressivity of individual Gaussian primitives. To address these issues, we draw inspiration from texture and alpha mapping in traditional graphics and integrate it with 3DGS. Specifically, we propose a new generalized Gaussian appearance representation that augments each Gaussian with alpha~(A), RGB, or RGBA texture maps to model spatially varying color and opacity across the extent of each Gaussian. As such, each Gaussian can represent a richer set of texture patterns and geometric structures, instead of just a single color and ellipsoid as in naive Gaussian Splatting. Surprisingly, we found that the expressivity of Gaussians can be greatly improved by using alpha-only texture maps, and further augmenting Gaussians with RGB texture maps achieves the highest expressivity. We validate our method on a wide variety of standard benchmark datasets and our own custom captures at both the object and scene levels. We demonstrate image quality improvements over existing methods while using a similar or lower number of Gaussians.

[237] DECOR: Deep Embedding Clustering with Orientation Robustness

Fiona Victoria Stanley Jothiraj, Arunaggiri Pandian Karunanidhi, Seth A. Eichmeyer

Main category: cs.CV

TL;DR: DECOR is a deep clustering framework for wafer defect detection that handles orientation variations and imperfect data conditions, outperforming existing methods on the MixedWM38 dataset.

DetailsMotivation: Early detection of wafer defects is critical for semiconductor manufacturing yield optimization, but raw wafer data is complex, unlabeled, imbalanced, and can contain multiple defects per wafer, requiring robust clustering methods.

Method: DECOR (Deep Clustering with Orientation Robustness) framework that groups complex defect patterns from wafer maps while explicitly accounting for orientation variations to ensure spatially similar defects are consistently clustered regardless of rotation or alignment.

Result: The method outperforms existing clustering baseline methods on the MixedWM38 dataset and demonstrates ability to discover clusters without manual tuning.

Conclusion: DECOR provides a reliable and scalable solution for automated visual inspection systems in semiconductor manufacturing by handling imperfect data conditions and orientation variations.

Abstract: In semiconductor manufacturing, early detection of wafer defects is critical for product yield optimization. However, raw wafer data from wafer quality tests are often complex, unlabeled, imbalanced and can contain multiple defects on a single wafer, making it crucial to design clustering methods that remain reliable under such imperfect data conditions. We introduce DECOR, a deep clustering with orientation robustness framework that groups complex defect patterns from wafer maps into consistent clusters. We evaluate our method on the open source MixedWM38 dataset, demonstrating its ability to discover clusters without manual tuning. DECOR explicitly accounts for orientation variations in wafer maps, ensuring that spatially similar defects are consistently clustered regardless of its rotation or alignment. Experiments indicate that our method outperforms existing clustering baseline methods, thus providing a reliable and scalable solution in automated visual inspection systems.

[238] Error correction in multiclass image classification of facial emotion on unbalanced samples

Andrey A. Lebedev, Victor B. Kazantsev, Sergey V. Stasenko

Main category: cs.CV

TL;DR: This paper proposes an error correction method for multi-class facial emotion classification using LSTM with attention mechanism to handle class imbalance, showing improved performance for minority classes.

DetailsMotivation: To address the problem of class imbalance in facial emotion recognition where some emotions are significantly more prevalent than others, which affects classification accuracy for rare emotional states.

Method: Uses an LSTM neural network with attention mechanism focusing on key facial areas. Trained on subsets of 6 emotion classes and performs error correction for the 7th excluded class. Experiments conducted on all possible class subset configurations.

Result: Error correction is possible for all classes with varying success rates - some classes are better restored than others. Test results show improvement in key quality metrics for small classes, indicating effectiveness for rare event detection.

Conclusion: The proposed method is effective for facial expression analysis systems and can be applied to tasks requiring stable classification under imbalanced class distributions, such as anti-fraud systems for detecting rare events.

Abstract: This paper considers the problem of error correction in multi-class classification of face images on unbalanced samples. The study is based on the analysis of a data frame containing images labeled by seven different emotional states of people of different ages. Particular attention is paid to the problem of class imbalance, in which some emotions significantly prevail over others. To solve the classification problem, a neural network model based on LSTM with an attention mechanism focusing on key areas of the face that are informative for emotion recognition is used. As part of the experiments, the model is trained on all possible configurations of subsets of six classes with subsequent error correction for the seventh class, excluded at the training stage. The results show that correction is possible for all classes, although the degree of success varies: some classes are better restored, others are worse. In addition, on the test sample, when correcting some classes, an increase in key quality metrics for small classes was recorded, which indicates the promise of the proposed approach in solving applied problems related to the search for rare events, for example, in anti-fraud systems. Thus, the proposed method can be effectively applied in facial expression analysis systems and in tasks requiring stable classification under skewed class distribution.

[239] QGFace: Quality-Guided Joint Training For Mixed-Quality Face Recognition

Youzhe Song, Feng Wang

Main category: cs.CV

TL;DR: A quality-guided joint training approach for mixed-quality face recognition that uses different learning methods for high-quality and low-quality images with a single encoder.

DetailsMotivation: Existing face recognition methods perform poorly on mixed-quality images as they are designed specifically for either high-quality or low-quality images, requiring pre-trained feature extractors or auxiliary structures.

Method: Proposes a joint training approach with quality partition: classification-based learning for HQ images and self-supervised image-image contrastive learning for LQ images, using a proxy-updated real-time queue for effective contrastive learning.

Result: Experiments on SCface, Tinyface, IJB-B, and five high-quality datasets demonstrate effectiveness in recognizing face images of different qualities.

Conclusion: The proposed quality-guided joint training approach successfully handles mixed-quality face recognition by applying appropriate learning strategies based on image quality, achieving improved performance across various quality datasets.

Abstract: The quality of a face crop in an image is decided by many factors such as camera resolution, distance, and illumination condition. This makes the discrimination of face images with different qualities a challenging problem in realistic applications. However, most existing approaches are designed specifically for high-quality (HQ) or low-quality (LQ) images, and the performances would degrade for the mixed-quality images. Besides, many methods ask for pre-trained feature extractors or other auxiliary structures to support the training and the evaluation. In this paper, we point out that the key to better understand both the HQ and the LQ images simultaneously is to apply different learning methods according to their qualities. We propose a novel quality-guided joint training approach for mixed-quality face recognition, which could simultaneously learn the images of different qualities with a single encoder. Based on quality partition, classification-based method is employed for HQ data learning. Meanwhile, for the LQ images which lack identity information, we learn them with self-supervised image-image contrastive learning. To effectively catch up the model update and improve the discriminability of contrastive learning in our joint training scenario, we further propose a proxy-updated real-time queue to compose the contrastive pairs with features from the genuine encoder. Experiments on the low-quality datasets SCface and Tinyface, the mixed-quality dataset IJB-B, and five high-quality datasets demonstrate the effectiveness of our proposed approach in recognizing face images of different qualities.

[240] OpusAnimation: Code-Based Dynamic Chart Generation

Bozheng Li, Miao Yang, Zhenhan Chen, Jiawang Cao, Mushui Liu, Yi Lu, Yongliang Wu, Bin Zhang, Yangguang Ji, Licheng Tang, Jay Wu, Wenbo Zhu

Main category: cs.CV

TL;DR: DCG-Bench is the first benchmark for dynamic chart generation, evaluating MLLMs on text-to-chart and video-to-chart tasks. The authors created DCG-8K dataset and developed a two-stage training method with joint reward optimization, achieving state-of-the-art performance with their 3B parameter model.

DetailsMotivation: Dynamic Chart Generation (DCG) using code-rendered animated visualizations is underexplored despite advances in multi-modal LLMs for static charts. There's a research gap in evaluating and improving MLLMs' capabilities for dynamic chart generation and understanding.

Method: Introduced DCG-Bench benchmark with three task types: Simple Text-to-Chart, Detailed Text-to-Chart, and Video-to-Chart. Built DCG-8K dataset with instruction-code-video triplets and QA pairs. Developed two-stage training recipe using Joint-Code-Visual Reward for group relative policy optimization to create expert MLLM Qwen2.5-VL-DCG-3B.

Result: Benchmarking revealed shortcomings of existing MLLMs in visual-to-chart tasks. The proposed model achieved 8.31% average performance gain over best open-sourced MLLM across three tasks, and showed comparable performance to proprietary models despite having only 3B parameters.

Conclusion: The DCG-Bench benchmark and proposed training methodology effectively address the dynamic chart generation challenge. The results demonstrate the effectiveness of the two-stage training recipe and joint reward optimization for creating capable DCG models with relatively small parameter counts.

Abstract: Dynamic Chart Generation (DCG) involves producing code-rendered animated visualizations as charts. While recent advances in multi-modal large language models (MLLMs) have significantly improved their capability on static chart generation and comprehension, MLLMs’ potential for handling dynamic chart generation and understanding remains underexplored. To bridge this research gap, we introduce DCG-Bench (Dynamic Chart Generation Benchmark), the first benchmark evaluating MLLM’s capability on dynamic chart generation tasks from three dimensions: Simple Text-to-Chart, Detailed Text-to-Chart, and Video-to-Chart tasks. We construct DCG-8K, a high-quality DCG dataset with annotations covering instruction-code-video triplets and QA pairs for both code and video evaluation. Based on DCG-8K, we explored a two-stage training recipe, proposing Joint-Code-Visual Reward for group relative policy optimization to construct expert MLLM Qwen2.5-VL-DCG-3B for the DCG task. Our benchmarking result reveals shortcomings of existing MLLMs in the visual-to-chart task, and our model beats the best open-sourced MLLM with an average 8.31% performance gain across three tasks, and shows on par performance against proprietary models with only 3B parameters, proving the effectiveness of our training recipe. Our code and dataset will be publicly available.

[241] STIV: Scalable Text and Image Conditioned Video Generation

Zongyu Lin, Wei Liu, Chen Chen, Jiasen Lu, Wenze Hu, Tsu-Jui Fu, Jesse Allardice, Zhengfeng Lai, Liangchen Song, Bowen Zhang, Cha Chen, Yiran Fei, Lezhi Li, Yizhou Sun, Kai-Wei Chang, Yinfei Yang

Main category: cs.CV

TL;DR: STIV is a simple and scalable text-image-conditioned video generation method that integrates image conditions through frame replacement and text conditions via joint classifier-free guidance, achieving state-of-the-art performance on T2V and I2V tasks.

DetailsMotivation: There is a need for a clear, systematic recipe to guide the development of robust and scalable video generation models, as the field has advanced but lacks transparent frameworks.

Method: STIV integrates image condition into a Diffusion Transformer (DiT) through frame replacement and incorporates text conditioning via joint image-text conditional classifier-free guidance, enabling simultaneous T2V and TI2V tasks.

Result: An 8.7B model achieves 83.1 on VBench T2V (surpassing CogVideoX-5B, Pika, Kling, Gen-3) and state-of-the-art 90.1 on VBench I2V at 512 resolution. The framework extends to video prediction, frame interpolation, multi-view generation, and long video generation.

Conclusion: STIV provides a transparent and extensible recipe for building cutting-edge video generation models, empowering future research and accelerating progress toward more versatile and reliable video generation solutions.

Abstract: The field of video generation has made remarkable advancements, yet there remains a pressing need for a clear, systematic recipe that can guide the development of robust and scalable models. In this work, we present a comprehensive study that systematically explores the interplay of model architectures, training recipes, and data curation strategies, culminating in a simple and scalable text-image-conditioned video generation method, named STIV. Our framework integrates image condition into a Diffusion Transformer (DiT) through frame replacement, while incorporating text conditioning via a joint image-text conditional classifier-free guidance. This design enables STIV to perform both text-to-video (T2V) and text-image-to-video (TI2V) tasks simultaneously. Additionally, STIV can be easily extended to various applications, such as video prediction, frame interpolation, multi-view generation, and long video generation, etc. With comprehensive ablation studies on T2I, T2V, and TI2V, STIV demonstrate strong performance, despite its simple design. An 8.7B model with 512 resolution achieves 83.1 on VBench T2V, surpassing both leading open and closed-source models like CogVideoX-5B, Pika, Kling, and Gen-3. The same-sized model also achieves a state-of-the-art result of 90.1 on VBench I2V task at 512 resolution. By providing a transparent and extensible recipe for building cutting-edge video generation models, we aim to empower future research and accelerate progress toward more versatile and reliable video generation solutions.

[242] Visual Odometry with Transformers

Vlardimir Yugay, Duy-Kien Nguyen, Theo Gevers, Cees G. M. Snoek, Martin R. Oswald

Main category: cs.CV

TL;DR: VoT is an end-to-end Visual odometry Transformer that directly predicts camera motion from monocular video sequences using temporal and spatial attention, eliminating traditional pipeline components like bundle adjustment and feature matching.

DetailsMotivation: Existing monocular visual odometry methods rely on complex pipelines with handcrafted components that struggle with generalization to unseen scenarios and require camera calibration and hyperparameter tuning.

Method: VoT processes monocular frame sequences using feature extraction and global relationship modeling through temporal and spatial attention. It directly predicts camera motion without dense geometry estimation, using only camera poses for supervision.

Result: VoT outperforms traditional methods, runs 3x faster, scales with larger datasets, benefits from stronger pre-trained backbones, and generalizes across diverse camera motions and calibration settings.

Conclusion: Monocular visual odometry can be effectively addressed end-to-end without traditional pipeline components, achieving better performance and generalization while being significantly faster.

Abstract: Modern monocular visual odometry methods typically combine pre-trained deep learning components with optimization modules, resulting in complex pipelines that rely heavily on camera calibration and hyperparameter tuning, and often struggle in unseen real-world scenarios. Recent large-scale 3D models trained on massive amounts of multi-modal data have partially alleviated these challenges, providing generalizable dense reconstruction and camera pose estimation. Still, they remain limited in handling long videos and providing accurate per-frame estimates, which are required for visual odometry. In this work, we demonstrate that monocular visual odometry can be addressed effectively in an end-to-end manner, thereby eliminating the need for handcrafted components such as bundle adjustment, feature matching, camera calibration, or dense 3D reconstruction. We introduce VoT, short for Visual odometry Transformer, which processes sequences of monocular frames by extracting features and modeling global relationships through temporal and spatial attention. Unlike prior methods, VoT directly predicts camera motion without estimating dense geometry and relies solely on camera poses for supervision. The framework is modular and flexible, allowing seamless integration of various pre-trained encoders as feature extractors. Experimental results demonstrate that VoT scales effectively with larger datasets, benefits substantially from stronger pre-trained backbones, generalizes across diverse camera motions and calibration settings, and outperforms traditional methods while running more than 3 times faster. The code will be released.

[243] Inference-Time Search using Side Information for Diffusion-based Image Reconstruction

Mahdi Farahbakhsh, Vishnu Teja Kunde, Dileep Kalathil, Krishna Narayanan, Jean-Francois Chamberland

Main category: cs.CV

TL;DR: A novel inference-time search algorithm that uses side information to guide diffusion models for solving inverse problems, improving reconstruction quality without gradient-based guidance artifacts.

DetailsMotivation: Existing diffusion model approaches for inverse problems overlook side information that could significantly improve reconstruction quality, especially in severely ill-posed settings.

Method: Proposed an inference-time search algorithm that guides sampling using side information while balancing exploration and exploitation, avoiding gradient-based guidance issues.

Result: Consistently improves qualitative and quantitative performance across various inverse problems including box inpainting, super-resolution, and multiple deblurring tasks, outperforming reward gradient-based guidance baselines.

Conclusion: The approach enables more accurate and reliable reconstructions, can be seamlessly integrated into existing diffusion-based pipelines, and provides an effective alternative to gradient-based guidance methods.

Abstract: Diffusion models have emerged as powerful priors for solving inverse problems. However, existing approaches typically overlook side information that could significantly improve reconstruction quality, especially in severely ill-posed settings. In this work, we propose a novel inference-time search algorithm that guides the sampling process using the side information in a manner that balances exploration and exploitation. This enables more accurate and reliable reconstructions, providing an alternative to the gradient-based guidance that is prone to reward-hacking artifacts. Our approach can be seamlessly integrated into a wide range of existing diffusion-based image reconstruction pipelines. Through extensive experiments on a number of inverse problems, such as box inpainting, super-resolution, and various deblurring tasks including motion, Gaussian, nonlinear, and blind deblurring, we show that our approach consistently improves the qualitative and quantitative performance of diffusion-based image reconstruction algorithms. We also show the superior performance of our approach with respect to other baselines, including reward gradient-based guidance algorithms. The code is available at \href{https://github.com/mhdfb/sideinfo-search-reconstruction}{this repository}.

[244] Unified Unsupervised Anomaly Detection via Matching Cost Filtering

Zhe Zhang, Mingxiu Cai, Gaochang Wu, Jing Zhang, Lingqiao Liu, Dacheng Tao, Tianyou Chai, Xiatian Zhu

Main category: cs.CV

TL;DR: Unified Cost Filtering (UCF) is a post-hoc refinement framework that enhances unsupervised anomaly detection by filtering matching noise in cost volumes, achieving state-of-the-art results across unimodal and multimodal settings.

DetailsMotivation: Existing unsupervised anomaly detection methods suffer from matching noise and lack unified approaches for both unimodal and multimodal scenarios, limiting detection performance and knowledge transfer.

Method: UCF constructs cost volumes by matching test samples against normal samples, then applies a learnable filtering module with multi-layer attention guidance to reduce matching noise and highlight subtle anomalies.

Result: Comprehensive experiments on 22 benchmarks show UCF consistently improves various UAD methods, achieving new state-of-the-art results in both unimodal (RGB) and multimodal (RGB-3D, RGB-Text) scenarios.

Conclusion: UCF provides a generic and effective framework for refining anomaly detection across diverse settings by addressing the fundamental challenge of matching noise in cost volumes.

Abstract: Unsupervised anomaly detection (UAD) aims to identify image- and pixel-level anomalies using only normal training data, with wide applications such as industrial inspection and medical analysis, where anomalies are scarce due to privacy concerns and cold-start constraints. Existing methods, whether reconstruction-based (restoring normal counterparts) or embedding-based (pretrained representations), fundamentally conduct image- or feature-level matching to generate anomaly maps. Nonetheless, matching noise has been largely overlooked, limiting their detection ability. Beyond earlier focus on unimodal RGB-based UAD, recent advances expand to multimodal scenarios, e.g., RGB–3D and RGB–Text, enabled by point cloud sensing and vision–language models. Despite shared challenges, these lines remain largely isolated, hindering a comprehensive understanding and knowledge transfer. In this paper, we advocate unified UAD for both unimodal and multimodal settings in the matching perspective. Under this insight, we present Unified Cost Filtering (UCF), a generic post-hoc refinement framework for refining anomaly cost volume of any UAD model. The cost volume is constructed by matching a test sample against normal samples from the same or different modalities, followed by a learnable filtering module with multi-layer attention guidance from the test sample, mitigating matching noise and highlighting subtle anomalies. Comprehensive experiments on 22 diverse benchmarks demonstrate the efficacy of UCF in enhancing a variety of UAD methods, consistently achieving new state-of-the-art results in both unimodal (RGB) and multimodal (RGB–3D, RGB–Text) UAD scenarios. Code and models will be released at https://github.com/ZHE-SAPI/CostFilter-AD.

[245] Sonar Image Datasets: A Comprehensive Survey of Resources, Challenges, and Applications

Larissa S. Gomes, Gustavo P. Almeida, Bryan U. Moreira, Marco Quiroz, Breno Xavier, Lucas Soares, Stephanie L. Brião, Felipe G. Oliveira, Paulo L. J. Drews-Jr

Main category: cs.CV

TL;DR: This paper provides a comprehensive review of publicly available sonar image datasets across various modalities, analyzing their characteristics and identifying gaps to guide researchers in underwater acoustic data analysis.

DetailsMotivation: The scarcity of publicly available, well-annotated sonar image datasets creates a significant bottleneck for developing robust machine learning models in underwater exploration, autonomous navigation, and ecosystem monitoring.

Method: The authors conducted a systematic review of sonar image datasets across multiple modalities (SSS, FLS, SAS, MBES, DIDSON), analyzed applications (classification, detection, segmentation, 3D reconstruction), and synthesized findings into a master table and chronological timeline.

Result: The review mapped publicly accessible datasets, offering clear comparison of characteristics, sizes, and annotation details, while identifying gaps in the current landscape of sonar image resources.

Conclusion: This work serves as a comprehensive base guide for researchers to start or advance in underwater acoustic data analysis, providing a clear roadmap through systematic cataloging and contextualization of existing sonar image datasets.

Abstract: Sonar images are relevant for advancing underwater exploration, autonomous navigation, and ecosystem monitoring. However, the progress depends on data availability. The scarcity of publicly available, well-annotated sonar image datasets creates a significant bottleneck for the development of robust machine learning models. This paper presents a comprehensive and concise review of the current landscape of sonar image datasets, seeking not only to catalog existing resources but also to contextualize them, identify gaps, and provide a clear roadmap, serving as a base guide for researchers of any kind who wish to start or advance in the field of underwater acoustic data analysis. We mapped publicly accessible datasets across various sonar modalities, including Side Scan Sonar (SSS), Forward-Looking Sonar (FLS), Synthetic Aperture Sonar (SAS), Multibeam Echo Sounder (MBES), and Dual-Frequency Identification Sonar (DIDSON). An analysis was conducted on applications such as classification, detection, segmentation, and 3D reconstruction. This work focuses on state-of-the-art advancements, incorporating newly released datasets. The findings are synthesized into a master table and a chronological timeline, offering a clear and accessible comparison of characteristics, sizes, and annotation details datasets.

[246] Visual Language Model as a Judge for Object Detection in Industrial Diagrams

Sanjukta Ghosh

Main category: cs.CV

TL;DR: A framework using Visual Language Models (VLMs) to automatically evaluate and refine object detection in industrial diagrams like P&IDs, addressing the lack of quality assessment methods.

DetailsMotivation: Industrial diagrams are crucial for digital twins and automation, but there's a gap in automatically evaluating object detection quality despite recent algorithm improvements.

Method: Employ Visual Language Models (VLMs) to assess object detection results by identifying missing or inconsistent detections using multimodal capabilities.

Result: The framework enables automated quality assessment and improves overall detection performance on complex industrial diagrams.

Conclusion: VLMs provide an effective solution for automated quality evaluation and refinement of object detection in industrial diagram digitalization.

Abstract: Industrial diagrams such as piping and instrumentation diagrams (P&IDs) are essential for the design, operation, and maintenance of industrial plants. Converting these diagrams into digital form is an important step toward building digital twins and enabling intelligent industrial automation. A central challenge in this digitalization process is accurate object detection. Although recent advances have significantly improved object detection algorithms, there remains a lack of methods to automatically evaluate the quality of their outputs. This paper addresses this gap by introducing a framework that employs Visual Language Models (VLMs) to assess object detection results and guide their refinement. The approach exploits the multimodal capabilities of VLMs to identify missing or inconsistent detections, thereby enabling automated quality assessment and improving overall detection performance on complex industrial diagrams.

[247] Learned Display Radiance Fields with Lensless Cameras

Ziyang Chen, Yuta Itoh, Kaan Akşit

Main category: cs.CV

TL;DR: A lensless camera and neural algorithm co-design for display calibration without specialized hardware, enabling light field reconstruction from multiple viewpoints.

DetailsMotivation: Display calibration is essential but difficult for most users due to requirements for specialized equipment and dark rooms, making it inaccessible.

Method: Co-designed a lensless camera with Implicit Neural Representation algorithm to capture display characteristics from various viewpoints, enabling light field reconstruction from a 46.6° × 37.6° viewing cone.

Result: The pipeline successfully reconstructs light fields emitted from displays across a wide viewing angle without requiring specialized hardware.

Conclusion: This emerging pipeline represents initial progress toward effortless display calibration and characterization, making the process more accessible to users.

Abstract: Calibrating displays is a basic and regular task that content creators must perform to maintain optimal visual experience, yet it remains a troublesome issue. Measuring display characteristics from different viewpoints often requires specialized equipment and a dark room, making it inaccessible to most users. To avoid specialized hardware requirements in display calibrations, our work co-designs a lensless camera and an Implicit Neural Representation based algorithm for capturing display characteristics from various viewpoints. More specifically, our pipeline enables efficient reconstruction of light fields emitted from a display from a viewing cone of 46.6{\deg} X 37.6{\deg}. Our emerging pipeline paves the initial steps towards effortless display calibration and characterization.

[248] Platonic Transformers: A Solid Choice For Equivariance

Mohammad Mohaiminul Islam, Rishabh Anand, David R. Wessels, Friso de Kruiff, Thijs P. Kuipers, Rex Ying, Clara I. Sánchez, Sharvaree Vadgama, Georg Bökman, Erik J. Bekkers

Main category: cs.CV

TL;DR: The Platonic Transformer introduces geometric inductive biases into Transformers using Platonic solid symmetry groups, achieving equivariance to continuous translations and Platonic symmetries without increasing computational cost.

DetailsMotivation: Transformers lack geometric symmetry biases needed for scientific and computer vision applications, while existing equivariant methods sacrifice efficiency and flexibility.

Method: Defines attention relative to reference frames from Platonic solid symmetry groups, creating a principled weight-sharing scheme that enables combined equivariance to continuous translations and Platonic symmetries.

Result: Achieves competitive performance across computer vision (CIFAR-10), 3D point clouds (ScanObjectNN), and molecular property prediction (QM9, OMol25) benchmarks while maintaining standard Transformer architecture and computational cost.

Conclusion: The Platonic Transformer resolves the trade-off between geometric equivariance and Transformer efficiency, providing geometric constraints at no additional computational cost.

Abstract: While widespread, Transformers lack inductive biases for geometric symmetries common in science and computer vision. Existing equivariant methods often sacrifice the efficiency and flexibility that make Transformers so effective through complex, computationally intensive designs. We introduce the Platonic Transformer to resolve this trade-off. By defining attention relative to reference frames from the Platonic solid symmetry groups, our method induces a principled weight-sharing scheme. This enables combined equivariance to continuous translations and Platonic symmetries, while preserving the exact architecture and computational cost of a standard Transformer. Furthermore, we show that this attention is formally equivalent to a dynamic group convolution, which reveals that the model learns adaptive geometric filters and enables a highly scalable, linear-time convolutional variant. Across diverse benchmarks in computer vision (CIFAR-10), 3D point clouds (ScanObjectNN), and molecular property prediction (QM9, OMol25), the Platonic Transformer achieves competitive performance by leveraging these geometric constraints at no additional cost.

[249] Provenance Networks: End-to-End Exemplar-Based Explainability

Ali Kayyam, Anusha Madan Gopal, M. Anthony Lewis

Main category: cs.CV

TL;DR: Provenance networks are neural models that provide end-to-end explainability by linking predictions directly to supporting training examples, similar to a learned KNN approach.

DetailsMotivation: Address model opaqueness, hallucination, and need for transparency in deep learning by embedding interpretability directly into neural architectures rather than using post-hoc methods.

Method: Joint optimization of primary task and explainability objective where each output is justified by concrete exemplars weighted by relevance in feature space, operating like a learned KNN.

Result: Enables systematic investigation of memorization vs generalization trade-off, verification of training set inclusion, detection of mislabeled data, enhanced resilience to input perturbations, and identification of similar inputs contributing to predictions.

Conclusion: Provenance networks provide complementary explainability approach that improves transparency, robustness, and trustworthiness in neural models, though with computational costs and current scaling limitations to moderately sized datasets.

Abstract: We introduce provenance networks, a novel class of neural models designed to provide end-to-end, training-data-driven explainability. Unlike conventional post-hoc methods, provenance networks learn to link each prediction directly to its supporting training examples as part of the model’s normal operation, embedding interpretability into the architecture itself. Conceptually, the model operates similarly to a learned KNN, where each output is justified by concrete exemplars weighted by relevance in the feature space. This approach facilitates systematic investigations of the trade-off between memorization and generalization, enables verification of whether a given input was included in the training set, aids in the detection of mislabeled or anomalous data points, enhances resilience to input perturbations, and supports the identification of similar inputs contributing to the generation of a new data point. By jointly optimizing the primary task and the explainability objective, provenance networks offer insights into model behavior that traditional deep networks cannot provide. While the model introduces additional computational cost and currently scales to moderately sized datasets, it provides a complementary approach to existing explainability techniques. In particular, it addresses critical challenges in modern deep learning, including model opaqueness, hallucination, and the assignment of credit to data contributors, thereby improving transparency, robustness, and trustworthiness in neural models.

[250] Unsupervised Transformer Pre-Training for Images: Self-Distillation, Mean Teachers, and Random Crops

Mattia Scardecchia

Main category: cs.CV

TL;DR: Survey paper examining DINOv2’s self-supervised learning approach, comparing it with other SSL and weakly supervised methods, and highlighting its emergent properties and limitations.

DetailsMotivation: Recent advances in SSL, particularly DINOv2, have surpassed weakly supervised methods on most benchmarks, prompting a comprehensive examination of its core ideas and performance.

Method: Analyzes DINOv2’s core techniques (multi-crop view augmentation and self-distillation with mean teacher), traces their development in previous work, and compares performance with other SSL and WSL methods across various downstream tasks.

Result: DINOv2 establishes new state-of-the-art performance, surpassing weakly supervised methods like OpenCLIP on most benchmarks, with transformer backbones showing remarkable emergent properties.

Conclusion: Discusses DINOv2’s limitations, its impact on the field, and suggests future research directions for self-supervised learning.

Abstract: Recent advances in self-supervised learning (SSL) have made it possible to learn general-purpose visual features that capture both the high-level semantics and the fine-grained spatial structure of images. Most notably, the recent DINOv2 has established a new state of the art by surpassing weakly supervised methods (WSL) like OpenCLIP on most benchmarks. In this survey, we examine the core ideas behind its approach, multi-crop view augmentation and self-distillation with a mean teacher, and trace their development in previous work. We then compare the performance of DINO and DINOv2 with other SSL and WSL methods across various downstream tasks, and highlight some remarkable emergent properties of their learned features with transformer backbones. We conclude by briefly discussing DINOv2’s limitations, its impact, and future research directions.

[251] Spatial-ViLT: Enhancing Visual Spatial Reasoning through Multi-Task Learning

Chashi Mahiul Islam, Oteo Mamo, Samuel Jacob Chacko, Xiuwen Liu, Weikuan Yu

Main category: cs.CV

TL;DR: SpatialViLT enhances vision-language models by integrating spatial features like depth maps and 3D coordinates through multi-task learning, achieving state-of-the-art spatial reasoning performance.

DetailsMotivation: Current VLMs struggle with spatial reasoning for 3D scenes and complex object configurations, limiting their ability to understand spatial relationships in multimodal data.

Method: Proposed SpatialViLT and MaskedSpatialViLT variants that integrate spatial features (depth maps, 3D coordinates, edge maps) using multi-task learning, with SpatialEnsemble combining both approaches.

Result: Achieved state-of-the-art accuracy on Visual Spatial Reasoning (VSR) dataset, excelling in directional, topological, and proximity relations.

Conclusion: This work significantly advances spatial intelligence in AI systems, which is crucial for advanced multimodal understanding and real-world applications.

Abstract: Vision-language models (VLMs) have advanced multimodal reasoning but still face challenges in spatial reasoning for 3D scenes and complex object configurations. To address this, we introduce SpatialViLT, an enhanced VLM that integrates spatial features like depth maps, 3D coordinates, and edge maps through a multi-task learning framework. This approach enriches multimodal embeddings with spatial understanding. We propose two variants: SpatialViLT and MaskedSpatialViLT, focusing on full and masked object regions, respectively. Additionally, SpatialEnsemble combines both approaches, achieving state-of-the-art accuracy. Our models excel in spatial reasoning categories such as directional, topological, and proximity relations, as demonstrated on the challenging Visual Spatial Reasoning (VSR) dataset. This work represents a significant step in enhancing the spatial intelligence of AI systems, crucial for advanced multimodal understanding and real-world applications.

[252] SPEGNet: Synergistic Perception-Guided Network for Camouflaged Object Detection

Baber Jan, Saeed Anwar, Aiman H. El-Maleh, Abdul Jabbar Siddiqui, Abdul Bais

Main category: cs.CV

TL;DR: SPEGNet is a unified architecture for camouflaged object detection that integrates multi-scale features through channel calibration and spatial enhancement, achieving state-of-the-art performance with real-time inference speed.

DetailsMotivation: Current camouflaged object detection methods accumulate complex components like boundary modules and attention mechanisms, creating computational burden without proportional gains and losing fine details due to reduced resolution processing.

Method: SPEGNet uses a unified design with multi-scale feature integration via channel calibration and spatial enhancement. Boundaries emerge from context-rich representations with progressive refinement using scale-adaptive edge modulation at intermediate resolutions.

Result: Achieves 0.887 Sα on CAMO, 0.890 on COD10K, and 0.895 on NC4K datasets with real-time inference speed. Excels across scales from tiny intricate objects to large pattern-similar ones while handling occlusion and ambiguous boundaries.

Conclusion: The unified approach strikes a balance between boundary precision and regional consistency, outperforming complex accumulated component methods while maintaining computational efficiency.

Abstract: Camouflaged object detection segments objects with intrinsic similarity and edge disruption. Current detection methods rely on accumulated complex components. Each approach adds components such as boundary modules, attention mechanisms, and multi-scale processors independently. This accumulation creates a computational burden without proportional gains. To manage this complexity, they process at reduced resolutions, eliminating fine details essential for camouflage. We present SPEGNet, addressing fragmentation through a unified design. The architecture integrates multi-scale features via channel calibration and spatial enhancement. Boundaries emerge directly from context-rich representations, maintaining semantic-spatial alignment. Progressive refinement implements scale-adaptive edge modulation with peak influence at intermediate resolutions. This design strikes a balance between boundary precision and regional consistency. SPEGNet achieves 0.887 $S_\alpha$ on CAMO, 0.890 on COD10K, and 0.895 on NC4K, with real-time inference speed. Our approach excels across scales, from tiny, intricate objects to large, pattern-similar ones, while handling occlusion and ambiguous boundaries. Code, model weights, and results are available on \href{https://github.com/Baber-Jan/SPEGNet}{https://github.com/Baber-Jan/SPEGNet}.

[253] Denoising of Two-Phase Optically Sectioned Structured Illumination Reconstructions Using Encoder-Decoder Networks

Allison Davis, Yezhi Shen, Xiaoyu Ji, Fengqing Zhu

Main category: cs.CV

TL;DR: Encoder-decoder networks trained on synthetic data effectively reduce artifacts in two-phase optical-sectioning structured illumination microscopy, improving image clarity without requiring clean ground-truth data.

DetailsMotivation: Two-phase optical-sectioning SI suffers from residual artifacts due to reduced acquisition time, and conventional denoising methods struggle with these artifacts. Supervised deep learning is limited by the lack of clean ground-truth data.

Method: Used encoder-decoder networks (asymmetrical denoising autoencoder and U-Net) trained on synthetic pairs created by applying real artifact fields to synthetic images, then evaluated on real OS-SI images.

Result: Both networks improved image clarity, with each network excelling against different types of artifacts. The approach successfully enabled supervised denoising of OS-SI images.

Conclusion: Synthetic training enables effective supervised denoising of OS-SI images, and encoder-decoder networks show potential for streamlining reconstruction workflows in structured illumination microscopy.

Abstract: Structured illumination (SI) enhances image resolution and contrast by projecting patterned light onto a sample. In two-phase optical-sectioning SI (OS-SI), reduced acquisition time introduces residual artifacts that conventional denoising struggles to suppress. Deep learning offers an alternative to traditional methods; however, supervised training is limited by the lack of clean, optically sectioned ground-truth data. We investigate encoder-decoder networks for artifact reduction in two-phase OS-SI, using synthetic training pairs formed by applying real artifact fields to synthetic images. An asymmetrical denoising autoencoder (DAE) and a U-Net are trained on the synthetic data, then evaluated on real OS-SI images. Both networks improve image clarity, with each excelling against different artifact types. These results demonstrate that synthetic training enables supervised denoising of OS-SI images and highlight the potential of encoder-decoder networks to streamline reconstruction workflows.

[254] PEaRL: Pathway-Enhanced Representation Learning for Gene and Pathway Expression Prediction from Histology

Sejuti Majumder, Saarthak Kapse, Moinak Bhattacharya, Xuan Xu, Alisa Yurovsky, Prateek Prasanna

Main category: cs.CV

TL;DR: PEaRL is a multimodal framework that integrates histopathology with spatial transcriptomics using pathway activation scores instead of individual genes, improving prediction accuracy and biological interpretability.

DetailsMotivation: Existing multimodal approaches rely on limited sets of highly variable genes, which restricts predictive scope and misses coordinated biological programs that shape tissue phenotypes.

Method: PEaRL represents transcriptomics through pathway activation scores computed with ssGSEA, encodes pathway signals with a transformer, and aligns them with histology features via contrastive learning.

Result: Across three cancer ST datasets (breast, skin, lymph node), PEaRL outperforms SOTA methods with up to 58.9% and 20.4% increase in Pearson correlation for gene- and pathway-level expression prediction respectively.

Conclusion: Grounding transcriptomic representation in pathways produces more biologically faithful and interpretable multimodal models, advancing computational pathology beyond gene-level embeddings.

Abstract: Integrating histopathology with spatial transcriptomics (ST) provides a powerful opportunity to link tissue morphology with molecular function. Yet most existing multimodal approaches rely on a small set of highly variable genes, which limits predictive scope and overlooks the coordinated biological programs that shape tissue phenotypes. We present PEaRL (Pathway Enhanced Representation Learning), a multimodal framework that represents transcriptomics through pathway activation scores computed with ssGSEA. By encoding biologically coherent pathway signals with a transformer and aligning them with histology features via contrastive learning, PEaRL reduces dimensionality, improves interpretability, and strengthens cross-modal correspondence. Across three cancer ST datasets (breast, skin, and lymph node), PEaRL consistently outperforms SOTA methods, yielding higher accuracy for both gene- and pathway-level expression prediction (up to 58.9 percent and 20.4 percent increase in Pearson correlation coefficient compared to SOTA). These results demonstrate that grounding transcriptomic representation in pathways produces more biologically faithful and interpretable multimodal models, advancing computational pathology beyond gene-level embeddings.

[255] DuPLUS: Dual-Prompt Vision-Language Framework for Universal Medical Image Segmentation and Prognosis

Numan Saeed, Tausifa Jan Saleem, Fadillah Maani, Muhammad Ridzuan, Hu Wang, Mohammad Yaqub

Main category: cs.CV

TL;DR: DuPLUS is a universal vision-language framework for medical imaging that uses hierarchical semantic prompts and dual-prompt mechanisms to enable fine-grained control over analysis tasks, outperforming state-of-the-art models on 8/10 datasets and achieving CI=0.69 for prognosis prediction.

DetailsMotivation: Address limitations of task-specific models lacking generalizability and existing universal approaches with simplistic conditioning and poor medical semantic understanding.

Method: Introduces hierarchical semantic prompts for fine-grained control, dual-prompt mechanism for text-controlled architecture, and parameter-efficient fine-tuning for rapid adaptation to new tasks and modalities.

Result: Outperforms state-of-the-art models on 8/10 datasets across 3 imaging modalities and 30+ organs/tumors, achieves CI=0.69 for prognosis prediction with EHR integration, and enables rapid adaptation to new tasks.

Conclusion: DuPLUS establishes itself as a versatile and clinically relevant solution for medical image analysis with strong generalizability and extensibility capabilities.

Abstract: Deep learning for medical imaging is hampered by task-specific models that lack generalizability and prognostic capabilities, while existing ‘universal’ approaches suffer from simplistic conditioning and poor medical semantic understanding. To address these limitations, we introduce DuPLUS, a deep learning framework for efficient multi-modal medical image analysis. DuPLUS introduces a novel vision-language framework that leverages hierarchical semantic prompts for fine-grained control over the analysis task, a capability absent in prior universal models. To enable extensibility to other medical tasks, it includes a hierarchical, text-controlled architecture driven by a unique dual-prompt mechanism. For segmentation, DuPLUS is able to generalize across three imaging modalities, ten different anatomically various medical datasets, encompassing more than 30 organs and tumor types. It outperforms the state-of-the-art task specific and universal models on 8 out of 10 datasets. We demonstrate extensibility of its text-controlled architecture by seamless integration of electronic health record (EHR) data for prognosis prediction, and on a head and neck cancer dataset, DuPLUS achieved a Concordance Index (CI) of 0.69. Parameter-efficient fine-tuning enables rapid adaptation to new tasks and modalities from varying centers, establishing DuPLUS as a versatile and clinically relevant solution for medical image analysis. The code for this work is made available at: https://anonymous.4open.science/r/DuPLUS-6C52

[256] Real-Time Threaded Houbara Detection and Segmentation for Wildlife Conservation using Mobile Platforms

Lyes Saad Saoud, Loic Lesobre, Enrico Sorato, Irfan Hussain

Main category: cs.CV

TL;DR: A mobile-optimized two-stage deep learning framework for real-time animal detection and segmentation in natural environments, using parallelized YOLOv10 detection and MobileSAM segmentation with threading to reduce latency.

DetailsMotivation: Real-time animal detection and segmentation are vital for wildlife conservation through non-invasive monitoring, but remain challenging due to limited computational resources and cryptic species appearance.

Method: Two-stage framework integrating Threading Detection Model (TDM) to parallelize YOLOv10-based detection and MobileSAM-based segmentation, executed concurrently for efficient resource use.

Result: Achieved mAP50 of 0.9627, mAP75 of 0.7731, mAP95 of 0.7178, and MobileSAM mIoU of 0.7421 on Houbara Bustard. YOLOv10 operates at 43.7 ms per frame, confirming real-time readiness.

Conclusion: The proposed framework successfully enables real-time animal detection and segmentation with high accuracy, supported by a curated dataset of 40,000 annotated Houbara images, making it suitable for conservation monitoring applications.

Abstract: Real-time animal detection and segmentation in natural environments are vital for wildlife conservation, enabling non-invasive monitoring through remote camera streams. However, these tasks remain challenging due to limited computational resources and the cryptic appearance of many species. We propose a mobile-optimized two-stage deep learning framework that integrates a Threading Detection Model (TDM) to parallelize YOLOv10-based detection and MobileSAM-based segmentation. Unlike prior YOLO+SAM pipelines, our approach improves real-time performance by reducing latency through threading. YOLOv10 handles detection while MobileSAM performs lightweight segmentation, both executed concurrently for efficient resource use. On the cryptic Houbara Bustard, a conservation-priority species, our model achieves mAP50 of 0.9627, mAP75 of 0.7731, mAP95 of 0.7178, and a MobileSAM mIoU of 0.7421. YOLOv10 operates at 43.7 ms per frame, confirming real-time readiness. We introduce a curated Houbara dataset of 40,000 annotated images to support model training and evaluation across diverse conditions. The code and dataset used in this study are publicly available on GitHub at https://github.com/LyesSaadSaoud/mobile-houbara-detseg. For interactive demos and additional resources, visit https://lyessaadsaoud.github.io/LyesSaadSaoud-Threaded-YOLO-SAM-Houbara.

[257] Domain Generalization for Semantic Segmentation: A Survey

Manuel Schwonberg, Hanno Gottschalk

Main category: cs.CV

TL;DR: This survey paper provides a comprehensive overview of domain generalized semantic segmentation, highlighting the paradigm shift towards foundation-model-based approaches and their significant impact on performance.

DetailsMotivation: Deep neural networks face challenges in generalizing to unknown domains, particularly in semantic segmentation tasks used in critical areas like biomedicine and automated driving. Domain generalization addresses this by enabling models to work across multiple unseen target domains without access to target domain data.

Method: The survey clusters and reviews existing domain generalization approaches for semantic segmentation, with a focus on identifying the emerging trend towards foundation-model-based methods.

Result: The paper provides extensive performance comparisons that demonstrate the significant influence of foundation models on improving domain generalization capabilities in semantic segmentation tasks.

Conclusion: This survey aims to advance domain generalization research and inspire exploration of new research directions, particularly highlighting the transformative impact of foundation models in this rapidly evolving field.

Abstract: The generalization of deep neural networks to unknown domains is a major challenge despite their tremendous progress in recent years. For this reason, the dynamic area of domain generalization (DG) has emerged. In contrast to unsupervised domain adaptation, there is no access to or knowledge about the target domains, and DG methods aim to generalize across multiple different unseen target domains. Domain generalization is particularly relevant for the task semantic segmentation which is used in several areas such as biomedicine or automated driving. This survey provides a comprehensive overview of the rapidly evolving topic of domain generalized semantic segmentation. We cluster and review existing approaches and identify the paradigm shift towards foundation-model-based domain generalization. Finally, we provide an extensive performance comparison of all approaches, which highlights the significant influence of foundation models on domain generalization. This survey seeks to advance domain generalization research and inspire scientists to explore new research directions.

[258] From Scope to Script: An Automated Report Generation Model for Gastrointestinal Endoscopy

Evandros Kaklamanos, Kristjana Kristinsdottir, Jonathan Huang, Dustin Carlson, Rajesh Keswani, John Pandolfino, Mozziyar Etemadi

Main category: cs.CV

TL;DR: A transformer-based automated report generation model for endoscopic procedures that reduces documentation burden on gastroenterologists.

DetailsMotivation: Endoscopic procedures create significant documentation burden leading to inefficiencies and physician burnout in gastroenterology workflows.

Method: Two-stage training framework: pre-training transformer-based vision encoder and text decoder on image/caption pairs, then fine-tuning on images/report pairs for clinical findings.

Result: The model generates clinically meaningful findings from endoscopic images, streamlining documentation process.

Conclusion: This approach reduces physician workload and improves patient care by automating report generation for endoscopic procedures.

Abstract: Endoscopic procedures such as esophagogastroduodenoscopy (EGD) and colonoscopy play a critical role in diagnosing and managing gastrointestinal (GI) disorders. However, the documentation burden associated with these procedures place significant strain on gastroenterologists, contributing to inefficiencies in clinical workflows and physician burnout. To address this challenge, we propose a novel automated report generation model that leverages a transformer-based vision encoder and text decoder within a two-stage training framework. In the first stage, both components are pre-trained on image/text caption pairs to capture generalized vision-language features, followed by fine-tuning on images/report pairs to generate clinically meaningful findings. Our approach not only streamlines the documentation process but also holds promise for reducing physician workload and improving patient care.

[259] SketchPlan: Diffusion Based Drone Planning From Human Sketches

Sixten Norelius, Aaron O. Feldman, Mac Schwager

Main category: cs.CV

TL;DR: SketchPlan is a diffusion-based system that converts 2D hand-drawn sketches over depth images into 3D drone flight paths, achieving zero-shot sim-to-real transfer with high success rates in real-world navigation.

DetailsMotivation: To enable intuitive drone navigation through hand-drawn sketches while addressing the gap between ideal 2D projections and real human sketches, and achieving robust performance in unseen real-world environments.

Method: Uses a two-component system: SketchAdapter maps human sketches to 2D paths, and DiffPath (diffusion model) generates 3D trajectories from 2D projections and depth images. Trained on 32k synthetic flight paths with photorealistic 3D Gaussian Splatting scenes, plus 872 human-labeled sketches.

Result: Achieved 100% success in low/medium clutter and 40% in high-clutter real-world environments, outperforming ablations by 20-60% in task completion. Demonstrated effective zero-shot sim-to-real transfer.

Conclusion: SketchPlan successfully bridges the gap between human sketches and 3D drone navigation, with modular design and mixed training data significantly improving interpretation of human intent and 3D path inference capabilities.

Abstract: We propose SketchPlan, a diffusion-based planner that interprets 2D hand-drawn sketches over depth images to generate 3D flight paths for drone navigation. SketchPlan comprises two components: a SketchAdapter that learns to map the human sketches to projected 2D paths, and DiffPath, a diffusion model that infers 3D trajectories from 2D projections and a first person view depth image. Our model achieves zero-shot sim-to-real transfer, generating accurate and safe flight paths in previously unseen real-world environments. To train the model, we build a synthetic dataset of 32k flight paths using a diverse set of photorealistic 3D Gaussian Splatting scenes. We automatically label the data by computing 2D projections of the 3D flight paths onto the camera plane, and use this to train the DiffPath diffusion model. However, since real human 2D sketches differ significantly from ideal 2D projections, we additionally label 872 of the 3D flight paths with real human sketches and use this to train the SketchAdapter to infer the 2D projection from the human sketch. We demonstrate SketchPlan’s effectiveness in both simulated and real-world experiments, and show through ablations that training on a mix of human labeled and auto-labeled data together with a modular design significantly boosts its capabilities to correctly interpret human intent and infer 3D paths. In real-world drone tests, SketchPlan achieved 100% success in low/medium clutter and 40% in unseen high-clutter environments, outperforming key ablations by 20-60% in task completion.

[260] Unmasking Puppeteers: Leveraging Biometric Leakage to Disarm Impersonation in AI-based Videoconferencing

Danial Samadi Vahdati, Tai Duc Nguyen, Ekta Prashnani, Koki Nagano, David Luebke, Orazio Gallo, Matthew Stamm

Main category: cs.CV

TL;DR: A biometric-based defense method for detecting identity hijacking in AI-based talking-head videoconferencing systems by analyzing pose-expression latents instead of reconstructed RGB video.

DetailsMotivation: AI talking-head systems use compact pose-expression latents that can be puppeteered to hijack identities, and existing deepfake detectors fail because every frame is synthetic.

Method: A pose-conditioned, large-margin contrastive encoder that isolates persistent identity cues from pose-expression latents while canceling transient pose and expression information.

Result: The method consistently outperforms existing puppeteering defenses, operates in real-time, and shows strong generalization to out-of-distribution scenarios across multiple talking-head generation models.

Conclusion: The proposed biometric leakage defense effectively detects identity swaps by analyzing pose-expression latents directly, providing real-time protection against puppeteering attacks in talking-head systems.

Abstract: AI-based talking-head videoconferencing systems reduce bandwidth by sending a compact pose-expression latent and re-synthesizing RGB at the receiver, but this latent can be puppeteered, letting an attacker hijack a victim’s likeness in real time. Because every frame is synthetic, deepfake and synthetic video detectors fail outright. To address this security problem, we exploit a key observation: the pose-expression latent inherently contains biometric information of the driving identity. Therefore, we introduce the first biometric leakage defense without ever looking at the reconstructed RGB video: a pose-conditioned, large-margin contrastive encoder that isolates persistent identity cues inside the transmitted latent while cancelling transient pose and expression. A simple cosine test on this disentangled embedding flags illicit identity swaps as the video is rendered. Our experiments on multiple talking-head generation models show that our method consistently outperforms existing puppeteering defenses, operates in real-time, and shows strong generalization to out-of-distribution scenarios.

[261] Streaming Drag-Oriented Interactive Video Manipulation: Drag Anything, Anytime!

Junbao Zhou, Yuan Zhou, Kesen Zhao, Qingshan Xu, Beier Zhu, Richang Hong, Hanwang Zhang

Main category: cs.CV

TL;DR: REVEL enables interactive drag-based video manipulation anytime on anything, addressing challenges of latent distribution drift and context interference through DragStream’s training-free approach.

DetailsMotivation: Current autoregressive video diffusion models lack fine-grained streaming control, making it difficult to ensure outputs consistently align with user expectations through interactive manipulation.

Method: Proposes DragStream with: 1) adaptive distribution self-rectification using neighboring frames’ statistics to constrain latent embedding drift; 2) spatial-frequency selective optimization to exploit contextual information while mitigating interference.

Result: DragStream can be seamlessly integrated into existing autoregressive video diffusion models and effectively enables streaming drag-oriented interactive video manipulation.

Conclusion: The proposed DragStream approach successfully resolves the REVEL task, providing versatile drag operations for video editing and animation with translation, deformation, and rotation effects.

Abstract: Achieving streaming, fine-grained control over the outputs of autoregressive video diffusion models remains challenging, making it difficult to ensure that they consistently align with user expectations. To bridge this gap, we propose \textbf{stReaming drag-oriEnted interactiVe vidEo manipuLation (REVEL)}, a new task that enables users to modify generated videos \emph{anytime} on \emph{anything} via fine-grained, interactive drag. Beyond DragVideo and SG-I2V, REVEL unifies drag-style video manipulation as editing and animating video frames with both supporting user-specified translation, deformation, and rotation effects, making drag operations versatile. In resolving REVEL, we observe: \emph{i}) drag-induced perturbations accumulate in latent space, causing severe latent distribution drift that halts the drag process; \emph{ii}) streaming drag is easily disturbed by context frames, thereby yielding visually unnatural outcomes. We thus propose a training-free approach, \textbf{DragStream}, comprising: \emph{i}) an adaptive distribution self-rectification strategy that leverages neighboring frames’ statistics to effectively constrain the drift of latent embeddings; \emph{ii}) a spatial-frequency selective optimization mechanism, allowing the model to fully exploit contextual information while mitigating its interference via selectively propagating visual cues along generation. Our method can be seamlessly integrated into existing autoregressive video diffusion models, and extensive experiments firmly demonstrate the effectiveness of our DragStream.

[262] GAS-MIL: Group-Aggregative Selection Multi-Instance Learning for Ensemble of Foundation Models in Digital Pathology Image Analysis

Peiran Quan, Zifan Gu, Zhuo Zhao, Qin Zhou, Donghan M. Yang, Ruichen Rong, Yang Xie, Guanghua Xiao

Main category: cs.CV

TL;DR: GAS-MIL is an ensemble framework that integrates features from multiple foundation models for computational pathology, achieving superior performance across cancer datasets without requiring manual feature selection or extensive fine-tuning.

DetailsMotivation: Adapting and benchmarking individual foundation models for specific diagnostic tasks in computational pathology is time-consuming and resource-intensive due to their scale and diversity.

Method: Group-Aggregative Selection Multi-Instance Learning (GAS-MIL) - a flexible ensemble framework that seamlessly integrates features from multiple foundation models while preserving their complementary strengths.

Result: Across three cancer datasets (prostate, ovarian, breast), GAS-MIL consistently achieves superior or on-par performance relative to individual foundation models and established MIL methods.

Conclusion: GAS-MIL enables efficient integration of heterogeneous foundation models, streamlining model deployment for pathology and providing a scalable foundation for future multimodal and precision oncology applications.

Abstract: Foundation models (FMs) have transformed computational pathology by providing powerful, general-purpose feature extractors. However, adapting and benchmarking individual FMs for specific diagnostic tasks is often time-consuming and resource-intensive, especially given their scale and diversity. To address this challenge, we introduce Group-Aggregative Selection Multi-Instance Learning (GAS-MIL), a flexible ensemble framework that seamlessly integrates features from multiple FMs, preserving their complementary strengths without requiring manual feature selection or extensive task-specific fine-tuning. Across classification tasks in three cancer datasets-prostate (PANDA), ovarian (UBC-OCEAN), and breast (TCGA-BrCa)-GAS-MIL consistently achieves superior or on-par performance relative to individual FMs and established MIL methods, demonstrating its robustness and generalizability. By enabling efficient integration of heterogeneous FMs, GAS-MIL streamlines model deployment for pathology and provides a scalable foundation for future multimodal and precision oncology applications.

[263] Real-Time Assessment of Bystander Situation Awareness in Drone-Assisted First Aid

Shen Chang, Renran Tian, Nicole Adams, Nan Kong

Main category: cs.CV

TL;DR: This paper introduces a drone-assisted naloxone delivery simulation dataset and a real-time situational awareness assessment framework using graph embeddings and transformers to help bystanders respond to opioid overdoses before EMS arrival.

DetailsMotivation: To address the critical need for real-time situational awareness assessment in human-autonomy teaming for opioid overdose emergencies, where bystander awareness is crucial for effective drone-assisted naloxone delivery.

Method: Created the Drone-Assisted Naloxone Delivery Simulation Dataset (DANDSD) with college students as bystanders, then developed a video-based real-time SA assessment framework using graph embeddings and transformer models that integrate visual perception and comprehension cues.

Result: The framework achieves high-performance SA prediction with strong temporal segmentation, outperforming FINCH baseline by 9% in Mean over Frames (MoF) and 5% in Intersection over Union (IoU).

Conclusion: The work enables development of adaptive drone systems that can effectively guide bystanders during opioid overdose emergencies, potentially improving emergency response outcomes and saving lives.

Abstract: Rapid naloxone delivery via drones offers a promising solution for responding to opioid overdose emergencies (OOEs), by extending lifesaving interventions to medically untrained bystanders before emergency medical services (EMS) arrive. Recognizing the critical role of bystander situational awareness (SA) in human-autonomy teaming (HAT), we address a key research gap in real-time SA assessment by introducing the Drone-Assisted Naloxone Delivery Simulation Dataset (DANDSD). This pioneering dataset captures HAT during simulated OOEs, where college students without medical training act as bystanders tasked with administering intranasal naloxone to a mock overdose victim. Leveraging this dataset, we propose a video-based real-time SA assessment framework that utilizes graph embeddings and transformer models to assess bystander SA in real time. Our approach integrates visual perception and comprehension cues–such as geometric, kinematic, and interaction graph features–and achieves high-performance SA prediction. It also demonstrates strong temporal segmentation accuracy, outperforming the FINCH baseline by 9% in Mean over Frames (MoF) and 5% in Intersection over Union (IoU). This work supports the development of adaptive drone systems capable of guiding bystanders effectively, ultimately improving emergency response outcomes and saving lives.

[264] Evaluating OCR performance on food packaging labels in South Africa

Mayimunah Nagayi, Alice Khan, Tamryn Frank, Rina Swart, Clement Nyirenda

Main category: cs.CV

TL;DR: Evaluation of four OCR systems (Tesseract, EasyOCR, PaddleOCR, TrOCR) on food packaging images shows Tesseract has best accuracy, EasyOCR offers good multilingual balance, PaddleOCR has high coverage but is slow, while TrOCR performs worst despite GPU acceleration.

DetailsMotivation: Accurate OCR for food packaging is important for compliance and nutrition monitoring, but challenging due to multilingual text, dense layouts, varied fonts, glare, and curved surfaces.

Method: Used dataset of 231 products (1,628 images) processed by four OCR models, with ground truth subset of 113 images (60 products) for accuracy evaluation using metrics including CER, WER, BLEU, ROUGE-L, F1, coverage, and execution time.

Result: Tesseract achieved lowest CER (0.912) and highest BLEU (0.245). EasyOCR provided good balance between accuracy and multilingual support. PaddleOCR achieved near complete coverage but was slower. TrOCR produced weakest results despite GPU acceleration.

Conclusion: Results provide packaging-specific benchmark, establish baseline, and highlight directions for layout-aware methods and text localization.

Abstract: This study evaluates four open-source Optical Character Recognition (OCR) systems which are Tesseract, EasyOCR, PaddleOCR, and TrOCR on real world food packaging images. The aim is to assess their ability to extract ingredient lists and nutrition facts panels. Accurate OCR for packaging is important for compliance and nutrition monitoring but is challenging due to multilingual text, dense layouts, varied fonts, glare, and curved surfaces. A dataset of 231 products (1,628 images) was processed by all four models to assess speed and coverage, and a ground truth subset of 113 images (60 products) was created for accuracy evaluation. Metrics include Character Error Rate (CER), Word Error Rate (WER), BLEU, ROUGE-L, F1, coverage, and execution time. On the ground truth subset, Tesseract achieved the lowest CER (0.912) and the highest BLEU (0.245). EasyOCR provided a good balance between accuracy and multilingual support. PaddleOCR achieved near complete coverage but was slower because it ran on CPU only due to GPU incompatibility, and TrOCR produced the weakest results despite GPU acceleration. These results provide a packaging-specific benchmark, establish a baseline, and highlight directions for layout-aware methods and text localization.

[265] FrameOracle: Learning What to See and How Much to See in Videos

Chaoyu Li, Tianzhi Li, Fei Tao, Zhenyu Zhao, Ziqian Wu, Maozheng Zhao, Juntong Song, Cheng Niu, Pooyan Fazli

Main category: cs.CV

TL;DR: FrameOracle is a lightweight plug-and-play module that predicts which frames are most relevant to a query and how many frames are needed, reducing input frames by 35-78% while maintaining or improving accuracy.

DetailsMotivation: Existing frame sampling strategies fail to adapt to variations in information density or task complexity, leading to inefficiency and information loss in video understanding.

Method: FrameOracle uses a four-stage curriculum training approach with weak proxy signals initially, then leverages FrameOracle-41K dataset with keyframe annotations for stronger supervision.

Result: Reduces 16-frame inputs to 10.4 frames without accuracy loss, and 64-frame inputs to 13.9 frames with 1.4% accuracy improvement across five VLMs and six benchmarks.

Conclusion: FrameOracle achieves state-of-the-art efficiency-accuracy trade-offs for scalable video understanding by adaptively selecting optimal frames.

Abstract: Vision-language models (VLMs) have advanced video understanding, but their performance is limited by the number of input frames they can process. Existing frame sampling strategies, such as uniform or fixed-budget selection, often fail to adapt to variations in information density or task complexity, resulting in inefficiency and information loss. To address this, we present FrameOracle, a lightweight and plug-and-play module that predicts both (1) which frames are most relevant to a given query and (2) how many frames are needed. FrameOracle is trained using a four-stage curriculum, with the first three stages relying on weak proxy signals such as cross-modal similarity. In the final stage, it leverages stronger supervision from a new dataset we introduce, FrameOracle-41K, the first large-scale VideoQA collection to provide keyframe annotations specifying the minimal set of frames required to answer each question. Extensive experiments across five VLMs and six benchmarks demonstrate that FrameOracle reduces 16-frame inputs to an average of 10.4 frames without any loss in accuracy. When starting from 64-frame candidates, it reduces the input to an average of 13.9 frames while improving accuracy by 1.4%, achieving state-of-the-art efficiency-accuracy trade-offs for scalable video understanding.

[266] A Hybrid Co-Finetuning Approach for Visual Bug Detection in Video Games

Faliu Yi, Sherif Abdelfattah, Wei Huang, Adrian Brown

Main category: cs.CV

TL;DR: Proposes a hybrid Co-FineTuning (CFT) method for visual bug detection in video games that combines labeled data from target and co-domain games with unlabeled data to reduce dependency on extensive labeled datasets.

DetailsMotivation: Manual visual bug detection in games is resource-intensive and costly, while supervised models require extensive labeled datasets which are challenging due to infrequent bug occurrences.

Method: Hybrid Co-FineTuning (CFT) that integrates labeled samples from target game and co-domain games with unlabeled data to enhance feature representation learning.

Result: Demonstrates superior performance compared to conventional baselines across multiple gaming environments, maintaining competitive performance with only 50% of labeled data from target game.

Conclusion: CFT method provides enhanced scalability and adaptability for efficient visual bug detection across various game titles while reducing dependency on labeled examples.

Abstract: Manual identification of visual bugs in video games is a resource-intensive and costly process, often demanding specialized domain knowledge. While supervised visual bug detection models offer a promising solution, their reliance on extensive labeled datasets presents a significant challenge due to the infrequent occurrence of such bugs. To overcome this limitation, we propose a hybrid Co-FineTuning (CFT) method that effectively integrates both labeled and unlabeled data. Our approach leverages labeled samples from the target game and diverse co-domain games, additionally incorporating unlabeled data to enhance feature representation learning. This strategy maximizes the utility of all available data, substantially reducing the dependency on labeled examples from the specific target game. The developed framework demonstrates enhanced scalability and adaptability, facilitating efficient visual bug detection across various game titles. Our experimental results show the robustness of the proposed method for game visual bug detection, exhibiting superior performance compared to conventional baselines across multiple gaming environments. Furthermore, CFT maintains competitive performance even when trained with only 50% of the labeled data from the target game.

[267] Exploring the Hierarchical Reasoning Model for Small Natural-Image Classification Without Augmentation

Alexander V. Mantzaris

Main category: cs.CV

TL;DR: HRM with Transformer modules performs well on MNIST but overfits and generalizes poorly on CIFAR datasets compared to simple CNNs, showing insufficient inductive bias for natural images.

DetailsMotivation: To test if the Hierarchical Reasoning Model (HRM) with Transformer-style modules can serve as a practical image classifier under raw training conditions without data augmentation.

Method: HRM with two Transformer modules, DEQ-style one-step training, deep supervision, Rotary Position Embeddings, and RMSNorm. Evaluated on MNIST, CIFAR-10, CIFAR-100 with no data augmentation, identical optimizer with one-epoch warmup and cosine-floor decay, and label smoothing.

Result: HRM achieves ~98% on MNIST but performs poorly on natural images: 65.0% on CIFAR-10 vs 77.2% for CNN baseline, and 29.7% on CIFAR-100 vs 45.3% for CNN. HRM trains ~30x slower per epoch and shows significant overfitting.

Conclusion: HRM is not competitive with simple convolutional architectures for small-resolution image classification without augmentation due to insufficient image-specific inductive bias, though modifications could potentially improve it.

Abstract: This paper asks whether the Hierarchical Reasoning Model (HRM) with the two Transformer-style modules $(f_L,f_H)$, one step (DEQ-style) training, deep supervision, Rotary Position Embeddings, and RMSNorm can serve as a practical image classifier. It is evaluated on MNIST, CIFAR-10, and CIFAR-100 under a deliberately raw regime: no data augmentation, identical optimizer family with one-epoch warmup then cosine-floor decay, and label smoothing. HRM optimizes stably and performs well on MNIST ($\approx 98%$ test accuracy), but on small natural images it overfits and generalizes poorly: on CIFAR-10, HRM reaches 65.0% after 25 epochs, whereas a two-stage Conv–BN–ReLU baseline attains 77.2% while training $\sim 30\times$ faster per epoch; on CIFAR-100, HRM achieves only 29.7% test accuracy despite 91.5% train accuracy, while the same CNN reaches 45.3% test with 50.5% train accuracy. Loss traces and error analyses indicate healthy optimization but insufficient image-specific inductive bias for HRM in this regime. It is concluded that, for small-resolution image classification without augmentation, HRM is not competitive with even simple convolutional architectures as the HRM currently exist but this does not exclude possibilities that modifications to the model may allow it to improve greatly.

[268] Diffusion-Classifier Synergy: Reward-Aligned Learning via Mutual Boosting Loop for FSCIL

Ruitao Wu, Yifan Zhao, Guangyao Chen, Jia Li

Main category: cs.CV

TL;DR: DCS introduces a mutual boosting loop between diffusion models and FSCIL classifiers using reward-aligned learning, achieving state-of-the-art performance on FSCIL benchmarks by enhancing knowledge retention and new class learning.

DetailsMotivation: Address the challenges of Few-Shot Class-Incremental Learning (FSCIL) where models struggle with generalization due to limited datasets and the stability-plasticity dilemma, while overcoming semantic misalignment issues in direct diffusion model applications.

Method: Proposes Diffusion-Classifier Synergy (DCS) with dynamic multi-faceted reward function at feature level (prototype-anchored maximum mean discrepancy, dimension-wise variance matching) and logits level (confidence recalibration, cross-session confusion-aware mechanisms) to guide diffusion model generation.

Result: Demonstrably achieves state-of-the-art performance on FSCIL benchmarks, significantly enhancing both knowledge retention and new class learning capabilities.

Conclusion: The co-evolutionary process between diffusion model and classifier through reward-aligned learning effectively addresses FSCIL challenges, providing a robust framework for few-shot incremental learning.

Abstract: Few-Shot Class-Incremental Learning (FSCIL) challenges models to sequentially learn new classes from minimal examples without forgetting prior knowledge, a task complicated by the stability-plasticity dilemma and data scarcity. Current FSCIL methods often struggle with generalization due to their reliance on limited datasets. While diffusion models offer a path for data augmentation, their direct application can lead to semantic misalignment or ineffective guidance. This paper introduces Diffusion-Classifier Synergy (DCS), a novel framework that establishes a mutual boosting loop between diffusion model and FSCIL classifier. DCS utilizes a reward-aligned learning strategy, where a dynamic, multi-faceted reward function derived from the classifier’s state directs the diffusion model. This reward system operates at two levels: the feature level ensures semantic coherence and diversity using prototype-anchored maximum mean discrepancy and dimension-wise variance matching, while the logits level promotes exploratory image generation and enhances inter-class discriminability through confidence recalibration and cross-session confusion-aware mechanisms. This co-evolutionary process, where generated images refine the classifier and an improved classifier state yields better reward signals, demonstrably achieves state-of-the-art performance on FSCIL benchmarks, significantly enhancing both knowledge retention and new class learning.

[269] MonitorVLM:A Vision Language Framework for Safety Violation Detection in Mining Operations

Jiang Wu, Sichao Wu, Yinsong Ma, Guangyuan Yu, Haoyuan Xu, Lifang Zheng, Jingliang Duan

Main category: cs.CV

TL;DR: MonitorVLM is a vision-language framework that automatically detects safety violations in mining surveillance videos using domain-specific VQA datasets, clause filtering for efficiency, and behavior magnification for better action recognition.

DetailsMotivation: Traditional manual safety inspection in mining is labor-intensive, error-prone, and inadequate for large-scale dynamic environments, creating an urgent need for automated intelligent safety monitoring systems.

Method: Uses a vision-language framework with three innovations: domain-specific VQA dataset (9,000 samples across 40 regulations), clause filter module for dynamic selection of relevant clauses, and behavior magnifier module that enhances worker regions for fine-grained action recognition.

Result: Significantly outperforms baseline models with 22.01% precision, 34.22% recall, and 28.37% F1 score improvements over 72B unfine-tuned baseline. Clause filter reduces inference latency by 13.56% while maintaining accuracy, and behavior magnifier adds 3.45% precision and 8.62% recall gains.

Conclusion: Demonstrates the potential of multimodal large models to enhance occupational safety monitoring in mining and other high-risk domains, with practical integration through a lightweight web-based interface for automatic violation reporting.

Abstract: Industrial accidents, particularly in high-risk domains such as surface and underground mining, are frequently caused by unsafe worker behaviors. Traditional manual inspection remains labor-intensive, error-prone, and insufficient for large-scale, dynamic environments, highlighting the urgent need for intelligent and automated safety monitoring. In this paper, we present MonitorVLM, a novel vision–language framework designed to detect safety violations directly from surveillance video streams. MonitorVLM introduces three key innovations: (1) a domain-specific violation dataset comprising 9,000 vision–question–answer (VQA) samples across 40 high-frequency mining regulations, enriched with augmentation and auxiliary detection cues; (2) a clause filter (CF) module that dynamically selects the Top-$K$ most relevant clauses, reducing inference latency by 13.56% while maintaining accuracy; and (3) a behavior magnifier (BM) module that enhances worker regions to improve fine-grained action recognition, yielding additional gains of 3.45% in precision and 8.62% in recall. Experimental results demonstrate that MonitorVLM significantly outperforms baseline vision–language models, achieving improvements of 22.01% in precision, 34.22% in recall, and 28.37% in F1 score over the 72B unfine-tuned baseline. A lightweight web-based interface further integrates MonitorVLM into practical workflows, enabling automatic violation reporting with video timestamping. This study highlights the potential of multimodal large models to enhance occupational safety monitoring in mining and beyond.

[270] A Novel Cloud-Based Diffusion-Guided Hybrid Model for High-Accuracy Accident Detection in Intelligent Transportation Systems

Siva Sai, Saksham Gupta, Vinay Chamola, Rajkumar Buyya

Main category: cs.CV

TL;DR: A novel hybrid model combining guidance classification with diffusion techniques for accident detection in ITS, achieving 97.32% accuracy through cloud-based implementation and conditional modules.

DetailsMotivation: To overcome shortcomings of conventional classification approaches in ITS by leveraging diffusion models' capacity to understand complex data distributions for improved accident detection.

Method: Hybrid model integrating fine-tuned ExceptionNet architecture outputs as input for diffusion model, using image tensors as conditioning. Features multiple conditional modules with time embeddings and image covariate embeddings to dynamically adapt network behavior during diffusion process. Implemented as cloud-based solution for scalability.

Result: Achieved 97.32% accuracy in image-based accident detection, outperforming baseline models. Comprehensive ablation study investigated diffusion characteristics including timestep schedulers, encoding techniques, timestep count, and architectural design changes.

Conclusion: The proposed diffusion model demonstrates superior performance in accident detection for ITS, providing a robust classification framework that effectively handles complex data distributions through conditional modulation and cloud-based scalability.

Abstract: The integration of Diffusion Models into Intelligent Transportation Systems (ITS) is a substantial improvement in the detection of accidents. We present a novel hybrid model integrating guidance classification with diffusion techniques. By leveraging fine-tuned ExceptionNet architecture outputs as input for our proposed diffusion model and processing image tensors as our conditioning, our approach creates a robust classification framework. Our model consists of multiple conditional modules, which aim to modulate the linear projection of inputs using time embeddings and image covariate embeddings, allowing the network to adapt its behavior dynamically throughout the diffusion process. To address the computationally intensive nature of diffusion models, our implementation is cloud-based, enabling scalable and efficient processing. Our strategy overcomes the shortcomings of conventional classification approaches by leveraging diffusion models inherent capacity to effectively understand complicated data distributions. We investigate important diffusion characteristics, such as timestep schedulers, timestep encoding techniques, timestep count, and architectural design changes, using a thorough ablation study, and have conducted a comprehensive evaluation of the proposed model against the baseline models on a publicly available dataset. The proposed diffusion model performs best in image-based accident detection with an accuracy of 97.32%.

[271] SAMSOD: Rethinking SAM Optimization for RGB-T Salient Object Detection

Zhengyi Liu, Xinrui Wang, Xianyong Fang, Zhengzheng Tu, Linbo Wang

Main category: cs.CV

TL;DR: SAMSOD improves RGB-T salient object detection by addressing modality imbalance and gradient conflicts through unimodal supervision, gradient deconfliction, and decoupled adapters for high/low-activation neurons.

DetailsMotivation: Existing methods using Segment Anything Model for RGB-T SOD ignore the imbalance convergence between two modalities and gradient differences between high- and low-activations, limiting performance.

Method: Proposes SAMSOD with unimodal supervision to enhance non-dominant modality learning, gradient deconfliction to reduce conflicting gradients, and two decoupled adapters to separately mask high- and low-activation neurons for better foreground emphasis.

Result: Demonstrates effectiveness through experiments on RGB-T SOD benchmark datasets, scribble supervised RGB-T SOD, fully supervised RGB-D SOD datasets, and RGB-D rail surface defect detection.

Conclusion: The proposed method effectively addresses modality imbalance and gradient conflicts, showing superior performance across multiple SOD tasks and datasets.

Abstract: RGB-T salient object detection (SOD) aims to segment attractive objects by combining RGB and thermal infrared images. To enhance performance, the Segment Anything Model has been fine-tuned for this task. However, the imbalance convergence of two modalities and significant gradient difference between high- and low- activations are ignored, thereby leaving room for further performance enhancement. In this paper, we propose a model called \textit{SAMSOD}, which utilizes unimodal supervision to enhance the learning of non-dominant modality and employs gradient deconfliction to reduce the impact of conflicting gradients on model convergence. The method also leverages two decoupled adapters to separately mask high- and low-activation neurons, emphasizing foreground objects by enhancing background learning. Fundamental experiments on RGB-T SOD benchmark datasets and generalizability experiments on scribble supervised RGB-T SOD, fully supervised RGB-D SOD datasets and full-supervised RGB-D rail surface defect detection all demonstrate the effectiveness of our proposed method.

[272] Referring Expression Comprehension for Small Objects

Kanoko Goto, Takumi Hirose, Mahiro Ukai, Shuhei Kurita, Nakamasa Inoue

Main category: cs.CV

TL;DR: The paper introduces SOREC dataset and PIZA method to improve referring expression comprehension for small objects in driving scenarios.

DetailsMotivation: Localizing extremely small objects in referring expression comprehension remains challenging despite advances in vision-language learning, especially for real-world applications like autonomous driving.

Method: Proposed progressive-iterative zooming adapter (PIZA) for parameter-efficient fine-tuning that enables models to progressively zoom in and localize small objects. Also created SOREC dataset with 100,000 expression-bounding box pairs for small objects in driving scenarios.

Result: Applied PIZA to GroundingDINO and demonstrated significant improvement in accuracy on the SOREC dataset.

Conclusion: The proposed dataset and method effectively address the challenge of small object localization in referring expression comprehension, with publicly available resources for further research.

Abstract: Referring expression comprehension (REC) aims to localize the target object described by a natural language expression. Recent advances in vision-language learning have led to significant performance improvements in REC tasks. However, localizing extremely small objects remains a considerable challenge despite its importance in real-world applications such as autonomous driving. To address this issue, we introduce a novel dataset and method for REC targeting small objects. First, we present the small object REC (SOREC) dataset, which consists of 100,000 pairs of referring expressions and corresponding bounding boxes for small objects in driving scenarios. Second, we propose the progressive-iterative zooming adapter (PIZA), an adapter module for parameter-efficient fine-tuning that enables models to progressively zoom in and localize small objects. In a series of experiments, we apply PIZA to GroundingDINO and demonstrate a significant improvement in accuracy on the SOREC dataset. Our dataset, codes and pre-trained models are publicly available on the project page.

[273] Artery-Vein Segmentation from Fundus Images using Deep Learning

Sharan SK, Subin Sahayam, Umarani Jayaraman, Lakshmi Priya A

Main category: cs.CV

TL;DR: Proposes Attention-WNet, a deep learning model with attention mechanism for retinal artery-vein segmentation, achieving state-of-the-art performance on HRF and DRIVE datasets.

DetailsMotivation: Retinal vessel analysis can provide biomarkers for diagnosing eye diseases and identifying patients at risk of systemic vasculature diseases like stroke and myocardial infarction.

Method: Incorporates attention mechanism into the WNet deep learning model to create Attention-WNet for artery-vein segmentation.

Result: Outperformed other state-of-the-art models on publicly available HRF and DRIVE datasets.

Conclusion: The proposed Attention-WNet approach demonstrates superior performance for retinal artery-vein segmentation compared to existing methods.

Abstract: Segmenting of clinically important retinal blood vessels into arteries and veins is a prerequisite for retinal vessel analysis. Such analysis can provide potential insights and bio-markers for identifying and diagnosing various retinal eye diseases. Alteration in the regularity and width of the retinal blood vessels can act as an indicator of the health of the vasculature system all over the body. It can help identify patients at high risk of developing vasculature diseases like stroke and myocardial infarction. Over the years, various Deep Learning architectures have been proposed to perform retinal vessel segmentation. Recently, attention mechanisms have been increasingly used in image segmentation tasks. The work proposes a new Deep Learning approach for artery-vein segmentation. The new approach is based on the Attention mechanism that is incorporated into the WNet Deep Learning model, and we call the model as Attention-WNet. The proposed approach has been tested on publicly available datasets such as HRF and DRIVE datasets. The proposed approach has outperformed other state-of-art models available in the literature.

[274] Person-Centric Annotations of LAION-400M: Auditing Bias and Its Transfer to Models

Leander Girrbach, Stephan Alaniz, Genevieve Smith, Trevor Darrell, Zeynep Akata

Main category: cs.CV

TL;DR: The paper creates demographic annotations for LAION-400M dataset and shows direct links between training data co-occurrences and downstream model biases in vision-language models.

DetailsMotivation: Vision-language models show strong demographic biases, but the role of training data in producing these biases remains unclear due to lack of demographic annotations in web-scale datasets like LAION-400M.

Method: Created person-centric annotations for LAION-400M using validated automatic labeling pipelines combining object detection, multimodal captioning, and finetuned classifiers, including bounding boxes, perceived gender and race/ethnicity labels, and auto-generated captions.

Result: Uncovered demographic imbalances and harmful associations (e.g., men and individuals perceived as Black or Middle Eastern disproportionately linked with crime-related content), and showed 60-70% of gender bias in CLIP and Stable Diffusion can be linearly explained by direct co-occurrences in the data.

Conclusion: The resources establish the first large-scale empirical link between dataset composition and downstream model bias, providing evidence that training data imbalances directly contribute to model biases.

Abstract: Vision-language models trained on large-scale multimodal datasets show strong demographic biases, but the role of training data in producing these biases remains unclear. A major barrier has been the lack of demographic annotations in web-scale datasets such as LAION-400M. We address this gap by creating person-centric annotations for the full dataset, including over 276 million bounding boxes, perceived gender and race/ethnicity labels, and automatically generated captions. These annotations are produced through validated automatic labeling pipelines combining object detection, multimodal captioning, and finetuned classifiers. Using them, we uncover demographic imbalances and harmful associations, such as the disproportionate linking of men and individuals perceived as Black or Middle Eastern with crime-related and negative content. We also show that 60-70% of gender bias in CLIP and Stable Diffusion can be linearly explained by direct co-occurrences in the data. Our resources establish the first large-scale empirical link between dataset composition and downstream model bias.

[275] Mapping Rio de Janeiro’s favelas: general-purpose vs. satellite-specific neural networks

Thomas Hallopeau, Joris Guérin, Laurent Demagistri, Youssef Fouzai, Renata Gracie, Vanderlei Pascoal De Matos, Helen Gurgel, Nadine Dessay

Main category: cs.CV

TL;DR: Comparison of pretrained neural networks for favela detection: generic networks trained on large diverse datasets vs specialized networks trained on satellite imagery.

DetailsMotivation: Deep learning methods for informal settlement detection haven't fully utilized recent pretrained neural networks, creating a need to determine whether task specificity or data volume yields better performance.

Method: Compare two types of pretrained neural networks: 1) Generic networks pretrained on large diverse datasets of unspecific images, 2) Specialized network pretrained on satellite imagery.

Result: The paper investigates the trade-off between task specificity (specialized satellite imagery network) and data volume (generic networks with more training images) for urban informal settlement detection.

Conclusion: Research aims to determine which approach - task-specific pretraining or large-scale generic pretraining - provides superior performance for detecting informal settlements like favelas in Rio de Janeiro.

Abstract: While deep learning methods for detecting informal settlements have already been developed, they have not yet fully utilized the potential offered by recent pretrained neural networks. We compare two types of pretrained neural networks for detecting the favelas of Rio de Janeiro: 1. Generic networks pretrained on large diverse datasets of unspecific images, 2. A specialized network pretrained on satellite imagery. While the latter is more specific to the target task, the former has been pretrained on significantly more images. Hence, this research investigates whether task specificity or data volume yields superior performance in urban informal settlement detection.

[276] LoRA Patching: Exposing the Fragility of Proactive Defenses against Deepfakes

Zuomin Qu, Yimao Guo, Qianyue Hu, Wei Lu

Main category: cs.CV

TL;DR: LoRA patching bypasses proactive Deepfake defenses by injecting plug-and-play patches into generators, using adaptive gating and multi-modal feature alignment to defeat state-of-the-art protections with minimal training.

DetailsMotivation: Current proactive Deepfake defenses that embed adversarial perturbations in facial images lack robustness and reliability, creating a security vulnerability that needs to be addressed.

Method: Proposes Low-Rank Adaptation (LoRA) patching with learnable gating mechanism to prevent gradient explosions, and Multi-Modal Feature Alignment (MMFA) loss for semantic-level feature alignment. Also introduces defensive LoRA patching as a complementary solution.

Result: Successfully defeats multiple proactive defenses with only 1,000 facial examples and a single epoch of fine-tuning, revealing critical weaknesses in current defense paradigms.

Conclusion: Current Deepfake defense strategies have fundamental vulnerabilities and require more robust approaches to address the security risks posed by techniques like LoRA patching.

Abstract: Deepfakes pose significant societal risks, motivating the development of proactive defenses that embed adversarial perturbations in facial images to prevent manipulation. However, in this paper, we show that these preemptive defenses often lack robustness and reliability. We propose a novel approach, Low-Rank Adaptation (LoRA) patching, which injects a plug-and-play LoRA patch into Deepfake generators to bypass state-of-the-art defenses. A learnable gating mechanism adaptively controls the effect of the LoRA patch and prevents gradient explosions during fine-tuning. We also introduce a Multi-Modal Feature Alignment (MMFA) loss, encouraging the features of adversarial outputs to align with those of the desired outputs at the semantic level. Beyond bypassing, we present defensive LoRA patching, embedding visible warnings in the outputs as a complementary solution to mitigate this newly identified security vulnerability. With only 1,000 facial examples and a single epoch of fine-tuning, LoRA patching successfully defeats multiple proactive defenses. These results reveal a critical weakness in current paradigms and underscore the need for more robust Deepfake defense strategies. Our code is available at https://github.com/ZOMIN28/LoRA-Patching.

[277] The Overlooked Value of Test-time Reference Sets in Visual Place Recognition

Mubariz Zaffar, Liangliang Nan, Sebastian Scherer, Julian F. P. Kooij

Main category: cs.CV

TL;DR: The paper proposes Reference-Set-Finetuning (RSF), a method to improve Visual Place Recognition (VPR) performance by fine-tuning models on the test-time reference set (map) to bridge the train-test domain gap.

DetailsMotivation: Some VPR benchmarks remain challenging when test environments differ significantly from training datasets, creating a domain gap that limits performance of current SOTA methods.

Method: Proposed Reference-Set-Finetuning (RSF) - fine-tuning VPR models on the test-time reference set (map) before receiving test queries, leveraging the available reference images and poses from the target domain.

Result: RSF boosts SOTA VPR performance by ~2.3% average increase in Recall@1 on challenging datasets, while maintaining generalization and working across diverse test datasets.

Conclusion: Reference-set finetuning is an effective complementary approach that leverages test-time reference data to improve VPR performance on challenging domain-shift scenarios.

Abstract: Given a query image, Visual Place Recognition (VPR) is the task of retrieving an image of the same place from a reference database with robustness to viewpoint and appearance changes. Recent works show that some VPR benchmarks are solved by methods using Vision-Foundation-Model backbones and trained on large-scale and diverse VPR-specific datasets. Several benchmarks remain challenging, particularly when the test environments differ significantly from the usual VPR training datasets. We propose a complementary, unexplored source of information to bridge the train-test domain gap, which can further improve the performance of State-of-the-Art (SOTA) VPR methods on such challenging benchmarks. Concretely, we identify that the test-time reference set, the “map”, contains images and poses of the target domain, and must be available before the test-time query is received in several VPR applications. Therefore, we propose to perform simple Reference-Set-Finetuning (RSF) of VPR models on the map, boosting the SOTA (~2.3% increase on average for Recall@1) on these challenging datasets. Finetuned models retain generalization, and RSF works across diverse test datasets.

[278] Adaptively Sampling-Reusing-Mixing Decomposed Gradients to Speed Up Sharpness Aware Minimization

Jiaxin Deng, Junbiao Pang

Main category: cs.CV

TL;DR: ARSAM accelerates Sharpness-Aware Minimization (SAM) by decomposing SAM’s gradient into SGD gradient and PSF (Projection of Second-order gradient onto First-order gradient), then adaptively reusing and mixing these components to reduce computational cost while maintaining generalization.

DetailsMotivation: SAM improves model generalization but doubles computational cost compared to SGD by requiring twice the gradient calculations. This motivates the development of a more efficient alternative.

Method: ARSAM decomposes SAM’s gradient into SGD gradient and PSF, observes their dynamic evolution during training, and adaptively reuses PSF while timely updating it to maintain performance while reducing computations.

Result: ARSAM achieves comparable accuracies to SAM with about 40% speedup on CIFAR-10/100. It also accelerates various challenging tasks like human pose estimation and model quantization without performance loss.

Conclusion: ARSAM provides an efficient alternative to SAM that maintains generalization performance while significantly reducing computational overhead, demonstrating broad practicality across diverse tasks.

Abstract: Sharpness-Aware Minimization (SAM) improves model generalization but doubles the computational cost of Stochastic Gradient Descent (SGD) by requiring twice the gradient calculations per optimization step. To mitigate this, we propose Adaptively sampling-Reusing-mixing decomposed gradients to significantly accelerate SAM (ARSAM). Concretely, we firstly discover that SAM’s gradient can be decomposed into the SGD gradient and the Projection of the Second-order gradient onto the First-order gradient (PSF). Furthermore, we observe that the SGD gradient and PSF dynamically evolve during training, emphasizing the growing role of the PSF to achieve a flat minima. Therefore, ARSAM is proposed to the reused PSF and the timely updated PSF still maintain the model’s generalization ability. Extensive experiments show that ARSAM achieves state-of-the-art accuracies comparable to SAM across diverse network architectures. On CIFAR-10/100, ARSAM is comparable to SAM while providing a speedup of about 40%. Moreover, ARSAM accelerates optimization for the various challenge tasks (\textit{e.g.}, human pose estimation, and model quantization) without sacrificing performance, demonstrating its broad practicality.% The code is publicly accessible at: https://github.com/ajiaaa/ARSAM.

[279] CoPA: Hierarchical Concept Prompting and Aggregating Network for Explainable Diagnosis

Yiheng Dong, Yi Lin, Xin Yang

Main category: cs.CV

TL;DR: CoPA is a framework that improves concept-based medical diagnosis by extracting multi-layer visual concepts under prompt guidance, enhancing both concept and disease prediction accuracy.

DetailsMotivation: Current concept-based methods for clinical diagnostics rely only on final layer features, neglecting shallow/multiscale features and lacking guidance for fine-grained concept extraction.

Method: Uses Concept-aware Embedding Generator to extract concept representations from each encoder layer, and Concept Prompt Tuning to amplify concept-related visual cues through prompt guidance.

Result: Outperforms state-of-the-art methods on three public datasets, effectively capturing and utilizing concept-wise information for improved predictions.

Conclusion: CoPA successfully addresses limitations of existing concept-based methods by leveraging multi-layer features and prompt guidance for enhanced clinical diagnostic transparency.

Abstract: The transparency of deep learning models is essential for clinical diagnostics. Concept Bottleneck Model provides clear decision-making processes for diagnosis by transforming the latent space of black-box models into human-understandable concepts. However, concept-based methods still face challenges in concept capture capabilities. These methods often rely on encode features solely from the final layer, neglecting shallow and multiscale features, and lack effective guidance in concept encoding, hindering fine-grained concept extraction. To address these issues, we introduce Concept Prompting and Aggregating (CoPA), a novel framework designed to capture multilayer concepts under prompt guidance. This framework utilizes the Concept-aware Embedding Generator (CEG) to extract concept representations from each layer of the visual encoder. Simultaneously, these representations serve as prompts for Concept Prompt Tuning (CPT), steering the model towards amplifying critical concept-related visual cues. Visual representations from each layer are aggregated to align with textual concept representations. With the proposed method, valuable concept-wise information in the images is captured and utilized effectively, thus improving the performance of concept and disease prediction. Extensive experimental results demonstrate that CoPA outperforms state-of-the-art methods on three public datasets. Code is available at https://github.com/yihengd/CoPA.

[280] Efficiency vs. Efficacy: Assessing the Compression Ratio-Dice Score Relationship through a Simple Benchmarking Framework for Cerebrovascular 3D Segmentation

Shimaa Elbana, Ahmad Kamal, Shahd Ahmed Ali, Ahmad Al-Kabbany

Main category: cs.CV

TL;DR: ZFP compression achieves up to 22.89:1 data reduction for 3D medical imaging while maintaining high cerebrovascular segmentation quality (Dice 0.87656 vs 0.8774 baseline).

DetailsMotivation: Address challenges of large 3D medical imaging datasets that hinder collaborative research and transferability.

Method: Apply ZFP compression in error tolerance and fixed-rate modes to 3D medical dataset with ground-truth vascular segmentations, comparing segmentation quality on compressed vs uncompressed volumes.

Result: ZFP achieves substantial data reduction (up to 22.89:1 ratio) while maintaining high fidelity with mean Dice coefficient of 0.87656 compared to baseline 0.8774.

Conclusion: ZFP is a viable and powerful tool for enabling more efficient and accessible research on large-scale medical datasets, fostering broader collaboration.

Abstract: The increasing size and complexity of medical imaging datasets, particularly in 3D formats, present significant barriers to collaborative research and transferability. This study investigates whether the ZFP compression technique can mitigate these challenges without compromising the performance of automated cerebrovascular segmentation, a critical first step in intracranial aneurysm detection. We apply ZFP in both its error tolerance and fixed-rate modes to a large scale, and one of the most recent, datasets in the literature, 3D medical dataset containing ground-truth vascular segmentations. The segmentation quality on the compressed volumes is rigorously compared to the uncompressed baseline (Dice approximately equals 0.8774). Our findings reveal that ZFP can achieve substantial data reduction–up to a 22.89:1 ratio in error tolerance mode–while maintaining a high degree of fidelity, with the mean Dice coefficient remaining high at 0.87656. These results demonstrate that ZFP is a viable and powerful tool for enabling more efficient and accessible research on large-scale medical datasets, fostering broader collaboration across the community.

[281] MambaCAFU: Hybrid Multi-Scale and Multi-Attention Model with Mamba-Based Fusion for Medical Image Segmentation

T-Mai Bui, Fares Bougourzi, Fadi Dornaika, Vinh Truong Hoang

Main category: cs.CV

TL;DR: A hybrid medical image segmentation architecture combining CNNs, Transformers, and Mamba-based attention to capture local, global, and long-range dependencies, achieving state-of-the-art performance with balanced efficiency.

DetailsMotivation: Existing deep learning models for medical segmentation are often task-specific with varying performance across modalities and anatomical regions, while balancing model complexity and performance remains challenging in clinical settings.

Method: Three-branch encoder integrating CNNs, Transformers, and Mamba-based Attention Fusion (MAF) mechanism, with multi-scale attention-based CNN decoder and co-attention gate for enhanced feature selection and cross-scale communication.

Result: Outperforms state-of-the-art methods in accuracy and generalization on multiple benchmark datasets while maintaining comparable computational complexity.

Conclusion: The architecture effectively balances efficiency and effectiveness, offering a practical and scalable solution for diverse medical imaging tasks.

Abstract: In recent years, deep learning has shown near-expert performance in segmenting complex medical tissues and tumors. However, existing models are often task-specific, with performance varying across modalities and anatomical regions. Balancing model complexity and performance remains challenging, particularly in clinical settings where both accuracy and efficiency are critical. To address these issues, we propose a hybrid segmentation architecture featuring a three-branch encoder that integrates CNNs, Transformers, and a Mamba-based Attention Fusion (MAF) mechanism to capture local, global, and long-range dependencies. A multi-scale attention-based CNN decoder reconstructs fine-grained segmentation maps while preserving contextual consistency. Additionally, a co-attention gate enhances feature selection by emphasizing relevant spatial and semantic information across scales during both encoding and decoding, improving feature interaction and cross-scale communication. Extensive experiments on multiple benchmark datasets show that our approach outperforms state-of-the-art methods in accuracy and generalization, while maintaining comparable computational complexity. By effectively balancing efficiency and effectiveness, our architecture offers a practical and scalable solution for diverse medical imaging tasks. Source code and trained models will be publicly released upon acceptance to support reproducibility and further research.

[282] Road Damage and Manhole Detection using Deep Learning for Smart Cities: A Polygonal Annotation Approach

Rasel Hossen, Diptajoy Mistry, Mushiur Rahman, Waki As Sami Atikur Rahman Hridoy, Sajib Saha, Muhammad Ibrahim

Main category: cs.CV

TL;DR: This paper presents a YOLOv9-based deep learning system for automated road damage and manhole detection using polygonal annotations, achieving 78.1% overall accuracy with strong performance on road damage classes but poor manhole detection due to class imbalance.

DetailsMotivation: Manual monitoring of road damages is time-consuming, costly, and error-prone, especially in developing countries where urban infrastructure maintenance is critical for smart city development.

Method: Uses YOLOv9 algorithm with polygonal annotations (instead of traditional bounding boxes) for precise localization, trained on a novel dataset of 1000+ images from Dhaka, Bangladesh for three classes: Broken, Not Broken, and Manhole.

Result: Achieved 78.1% overall image-level accuracy with strong F1-scores for Broken (86.7%) and Not Broken (89.2%) classes, but poor performance for Manhole detection (18.2% F1-score) due to class imbalance.

Conclusion: The approach provides an efficient and scalable solution for urban infrastructure monitoring in developing countries, though manhole detection needs improvement through better class balancing.

Abstract: Urban safety and infrastructure maintenance are critical components of smart city development. Manual monitoring of road damages is time-consuming, highly costly, and error-prone. This paper presents a deep learning approach for automated road damage and manhole detection using the YOLOv9 algorithm with polygonal annotations. Unlike traditional bounding box annotation, we employ polygonal annotations for more precise localization of road defects. We develop a novel dataset comprising more than one thousand images which are mostly collected from Dhaka, Bangladesh. This dataset is used to train a YOLO-based model for three classes, namely Broken, Not Broken, and Manhole. We achieve 78.1% overall image-level accuracy. The YOLOv9 model demonstrates strong performance for Broken (86.7% F1-score) and Not Broken (89.2% F1-score) classes, with challenges in Manhole detection (18.2% F1-score) due to class imbalance. Our approach offers an efficient and scalable solution for monitoring urban infrastructure in developing countries.

[283] Contrastive-SDE: Guiding Stochastic Differential Equations with Contrastive Learning for Unpaired Image-to-Image Translation

Venkata Narendra Kotyada, Revanth Eranki, Nagesh Bhattu Sristy

Main category: cs.CV

TL;DR: Proposes Contrastive-SDE, a method combining contrastive learning with score-based diffusion models for unpaired image-to-image translation, achieving comparable performance to state-of-the-art with faster convergence and no label supervision.

DetailsMotivation: Unpaired image-to-image translation lacks aligned samples, while diffusion models excel at approximating complex distributions and contrastive learning effectively learns semantic similarities without supervision - combining these strengths addresses the unpaired I2I challenge.

Method: Uses time-dependent contrastive learning with SimCLR, treating an image and its domain-invariant feature as positive pairs. The learned contrastive model then guides inference of a pretrained stochastic differential equation (SDE) for translation.

Result: Achieves comparable results to state-of-the-art on several metrics across three unpaired I2I tasks. Model converges significantly faster and requires no label supervision or classifier training.

Conclusion: Contrastive-SDE provides an efficient alternative for unpaired image-to-image translation, combining the strengths of contrastive learning and diffusion models while eliminating the need for supervision and reducing training time.

Abstract: Unpaired image-to-image translation involves learning mappings between source domain and target domain in the absence of aligned or corresponding samples. Score based diffusion models have demonstrated state-of-the-art performance in generative tasks. Their ability to approximate complex data distributions through stochastic differential equations (SDEs) enables them to generate high-fidelity and diverse outputs, making them particularly well-suited for unpaired I2I settings. In parallel, contrastive learning provides a powerful framework for learning semantic similarities without the need for explicit supervision or paired data. By pulling together representations of semantically similar samples and pushing apart dissimilar ones, contrastive methods are inherently aligned with the objectives of unpaired translation. Its ability to selectively enforce semantic consistency at the feature level makes contrastive learning particularly effective for guiding generation in unpaired scenarios. In this work, we propose a time-dependent contrastive learning approach where a model is trained with SimCLR by considering an image and its domain invarient feature as a positive pair, enabling the preservation of domain-invariant features and the discarding of domain-specific ones. The learned contrastive model then guides the inference of a pretrained SDE for the I2I translation task. We empirically compare Contrastive-SDE with several baselines across three common unpaired I2I tasks, using four metrics for evaluation. Constrastive-SDE achieves comparable results to the state-of-the-art on several metrics. Furthermore, we observe that our model converges significantly faster and requires no label supervision or classifier training, making it a more efficient alternative for this task.

[284] LIBERO-PRO: Towards Robust and Fair Evaluation of Vision-Language-Action Models Beyond Memorization

Xueyang Zhou, Yangming Xu, Guiyao Tie, Yongchao Chen, Guowen Zhang, Duanfeng Chu, Pan Zhou, Lichao Sun

Main category: cs.CV

TL;DR: LIBERO-PRO extends the LIBERO benchmark to evaluate VLA models under realistic perturbations, revealing that models achieving 90%+ accuracy in standard settings collapse to 0% when tested on object variations, initial states, instruction changes, and environmental modifications.

DetailsMotivation: Current LIBERO benchmark settings lead to inflated performance estimates and prevent fair model comparison due to models' reliance on memorization rather than genuine understanding.

Method: Extended LIBERO benchmark with systematic perturbations across four dimensions: manipulated objects, initial states, task instructions, and environments to test model robustness.

Result: Existing models that achieve over 90% accuracy in standard LIBERO evaluation collapse to 0.0% accuracy under the generalized LIBERO-PRO setting, exposing reliance on rote memorization.

Conclusion: Current evaluation practices are severely flawed, and the community should adopt robust assessments that test genuine model generalization and comprehension rather than memorization capabilities.

Abstract: LIBERO has emerged as a widely adopted benchmark for evaluating Vision-Language-Action (VLA) models; however, its current training and evaluation settings are problematic, often leading to inflated performance estimates and preventing fair model comparison. To address these issues, we introduce LIBERO-PRO, an extended LIBERO benchmark that systematically evaluates model performance under reasonable perturbations across four dimensions: manipulated objects, initial states, task instructions, and environments. Experimental results reveal that, although existing models achieve over 90% accuracy under the standard LIBERO evaluation, their performance collapses to 0.0% under our generalized setting. Crucially, this discrepancy exposes the models’ reliance on rote memorization of action sequences and environment layouts from the training set, rather than genuine task understanding or environmental perception. For instance, models persist in executing grasping actions when the target object is replaced with irrelevant items, and their outputs remain unchanged even when given corrupted instructions or even messy tokens. These findings expose the severe flaws in current evaluation practices, and we call on the community to abandon misleading methodologies in favor of robust assessments of model generalization and comprehension. Our code is available at: https://github.com/Zxy-MLlab/LIBERO-PRO.

[285] Mirage: Unveiling Hidden Artifacts in Synthetic Images with Large Vision-Language Models

Pranav Sharma, Shivank Garg, Durga Toshniwal

Main category: cs.CV

TL;DR: The paper introduces Mirage dataset of AI-generated images with visible artifacts where current detectors fail, and investigates using Large Vision-Language Models for explainable AI image detection.

DetailsMotivation: AI-generated images are becoming increasingly difficult for standard detectors to identify, though humans can still distinguish them. There's a need for better detection methods that can explain their decisions.

Method: Created Mirage dataset with diverse AI-generated images showing visible artifacts. Evaluated Large Vision-Language Models (LVLMs) for explainable AI image detection on both Mirage and existing benchmark datasets.

Result: LVLMs are highly effective at detecting AI-generated images with visible artifacts, but their performance declines when images lack such visual cues.

Conclusion: LVLMs show promise for explainable AI image detection, particularly for images with visible artifacts, but struggle with more sophisticated AI-generated content lacking obvious visual cues.

Abstract: Recent advances in image generation models have led to models that produce synthetic images that are increasingly difficult for standard AI detectors to identify, even though they often remain distinguishable by humans. To identify this discrepancy, we introduce \textbf{Mirage}, a curated dataset comprising a diverse range of AI-generated images exhibiting visible artifacts, where current state-of-the-art detection methods largely fail. Furthermore, we investigate whether Large Vision-Language Models (LVLMs), which are increasingly employed as substitutes for human judgment in various tasks, can be leveraged for explainable AI image detection. Our experiments on both Mirage and existing benchmark datasets demonstrate that while LVLMs are highly effective at detecting AI-generated images with visible artifacts, their performance declines when confronted with images lacking such cues.

[286] UGround: Towards Unified Visual Grounding with Unrolled Transformers

Rui Qian, Xin Yin, Chuanhang Deng, Zhiyuan Peng, Jian Xiong, Wei Zhai, Dejing Dou

Main category: cs.CV

TL;DR: UGround introduces a unified visual grounding paradigm that dynamically selects intermediate transformer layers as “mask as prompt” instead of using fixed last hidden layers, addressing error propagation and spatial cue limitations in existing methods.

DetailsMotivation: Current visual grounding methods rely on fixed last hidden layers which amplify cumulative errors through layer-by-layer propagation without correction, and use tokens that implicitly project text to visual space without explicit spatial cues.

Method: Proposes Policy-Prompted Masking with two components: Stochastic Skip Connection (SSC) - a reinforcement learning policy that dynamically selects layers for tokens to connect to vision models, and Mask as Prompt (MasP) - uses similarity maps as soft logit masks to prompt SAM with explicit spatial cues.

Result: UGround unifies visual grounding within a single framework spanning traditional refer expression segmentation to reasoning segmentation, single-target to multi-target, and positive query to false premise scenarios.

Conclusion: UGround provides a more effective approach to visual grounding by dynamically selecting intermediate layers and using explicit spatial cues, addressing key limitations of existing methods while unifying various grounding tasks.

Abstract: We present UGround, a \textbf{U}nified visual \textbf{Ground}ing paradigm that dynamically selects intermediate layers across \textbf{U}nrolled transformers as mask as prompt'', diverging from the prevailing pipeline that leverages the fixed last hidden layer as \texttt{} as prompt’’. UGround addresses two primary challenges posed by the prevailing paradigm: (1) its reliance on the fixed last hidden layer, which sequentially amplifies cumulative errors arising from layer-by-layer propagation without intermediate correction, and (2) its use of \texttt{} as a prompt, which implicitly projects textual embeddings into visual space without explicit spatial cues (\eg, coordinates). Central to UGround is Policy-Prompted Masking, which comprises two key components: Stochastic Skip Connection (SSC) and Mask as Prompt (MasP). SSC is a reinforcement learning policy that, via stochastic sampling, allows each \texttt{} token to slide across unrolled transformer layers, enabling dynamic layer selection at which it connects to the vision model (\eg, SAM) in a skip-connection fashion. Given the selected hidden layer, MasP uses the similarity map derived from the \texttt{} token and image tokens as a soft logit mask to prompt SAM for mask generation, offering explicit spatial cues through its activation regions. To validate the effectiveness of UGround, we, for the first time, have unified visual grounding within a single framework from an attribute perspective, spanning from traditional refer expression segmentation to newly proposed reasoning segmentation, single-target to multi-target, positive query to false premise (empty target). All codes and models are publicly available at \href{https://github.com/rui-qian/UGround}{https://github.com/rui-qian/UGround}.

[287] Optimized Minimal 4D Gaussian Splatting

Minseo Lee, Byeonghyeon Lee, Lucas Yunkyu Lee, Eunsoo Lee, Sangmin Kim, Seunghyeon Song, Joo Chan Lee, Jong Hwan Ko, Jaesik Park, Eunbyung Park

Main category: cs.CV

TL;DR: OMG4 is a framework that reduces storage overhead in 4D Gaussian Splatting by progressively pruning and merging Gaussians while maintaining reconstruction quality.

DetailsMotivation: 4D Gaussian Splatting faces major storage challenges requiring millions of Gaussians for high-fidelity reconstruction, with existing methods having limitations in compression ratio or visual quality.

Method: Three-stage progressive pruning: (1) Gaussian Sampling to identify critical primitives, (2) Gaussian Pruning to remove redundancies, (3) Gaussian Merging to fuse similar primitives. Also integrates implicit appearance compression and Sub-Vector Quantization for 4D representations.

Result: Reduces model sizes by over 60% while maintaining reconstruction quality, significantly outperforming state-of-the-art methods on standard benchmark datasets.

Conclusion: OMG4 represents a significant advancement in compact 4D scene representation, enabling new applications by dramatically reducing storage requirements without sacrificing quality.

Abstract: 4D Gaussian Splatting has emerged as a new paradigm for dynamic scene representation, enabling real-time rendering of scenes with complex motions. However, it faces a major challenge of storage overhead, as millions of Gaussians are required for high-fidelity reconstruction. While several studies have attempted to alleviate this memory burden, they still face limitations in compression ratio or visual quality. In this work, we present OMG4 (Optimized Minimal 4D Gaussian Splatting), a framework that constructs a compact set of salient Gaussians capable of faithfully representing 4D Gaussian models. Our method progressively prunes Gaussians in three stages: (1) Gaussian Sampling to identify primitives critical to reconstruction fidelity, (2) Gaussian Pruning to remove redundancies, and (3) Gaussian Merging to fuse primitives with similar characteristics. In addition, we integrate implicit appearance compression and generalize Sub-Vector Quantization (SVQ) to 4D representations, further reducing storage while preserving quality. Extensive experiments on standard benchmark datasets demonstrate that OMG4 significantly outperforms recent state-of-the-art methods, reducing model sizes by over 60% while maintaining reconstruction quality. These results position OMG4 as a significant step forward in compact 4D scene representation, opening new possibilities for a wide range of applications. Our source code is available at https://minshirley.github.io/OMG4/.

[288] Cross-View Open-Vocabulary Object Detection in Aerial Imagery

Jyoti Kini, Rohit Gupta, Mubarak Shah

Main category: cs.CV

TL;DR: A novel framework for adapting open-vocabulary object detection from ground-view to aerial imagery through structured domain alignment, achieving significant performance improvements in zero-shot settings.

DetailsMotivation: Traditional object detection models are limited to fixed classes, making it costly to add new categories. Open-vocabulary detection enables identifying unseen classes without retraining, but domain shifts between ground-view and aerial imagery require specialized adaptation.

Method: Uses contrastive image-to-image alignment to enhance similarity between aerial and ground-view embeddings, and multi-instance vocabulary associations to align aerial images with text embeddings from pretrained models.

Result: Achieved improvements of +6.32 mAP on DOTAv2, +4.16 mAP on VisDrone, and +3.46 mAP on HRRSD in zero-shot setting compared to finetuned closed-vocabulary models.

Conclusion: The framework enables flexible and scalable object detection in aerial applications by effectively transferring open-vocabulary representations across domains.

Abstract: Traditional object detection models are typically trained on a fixed set of classes, limiting their flexibility and making it costly to incorporate new categories. Open-vocabulary object detection addresses this limitation by enabling models to identify unseen classes without explicit training. Leveraging pretrained models contrastively trained on abundantly available ground-view image-text classification pairs provides a strong foundation for open-vocabulary object detection in aerial imagery. Domain shifts, viewpoint variations, and extreme scale differences make direct knowledge transfer across domains ineffective, requiring specialized adaptation strategies. In this paper, we propose a novel framework for adapting open-vocabulary representations from ground-view images to solve object detection in aerial imagery through structured domain alignment. The method introduces contrastive image-to-image alignment to enhance the similarity between aerial and ground-view embeddings and employs multi-instance vocabulary associations to align aerial images with text embeddings. Extensive experiments on the xView, DOTAv2, VisDrone, DIOR, and HRRSD datasets are used to validate our approach. Our open-vocabulary model achieves improvements of +6.32 mAP on DOTAv2, +4.16 mAP on VisDrone (Images), and +3.46 mAP on HRRSD in the zero-shot setting when compared to finetuned closed-vocabulary dataset-specific model performance, thus paving the way for more flexible and scalable object detection systems in aerial applications.

[289] Exploring the Challenge and Value of Deep Learning in Automated Skin Disease Diagnosis

Runhao Liu, Ziming Chen, Peng Zhang

Main category: cs.CV

TL;DR: This review paper analyzes deep learning approaches for skin cancer diagnosis, discussing challenges like complex features, image noise, and data imbalance, and explores solutions including data augmentation, hybrid models, and feature fusion.

DetailsMotivation: Skin cancer is a prevalent and deadly disease where early detection is crucial. Deep learning shows promise for automated diagnosis but faces challenges that need to be addressed.

Method: The review follows the PRISMA framework methodology and synthesizes recent research on DL approaches for skin cancer diagnosis.

Result: The paper identifies innovative approaches to overcome DL challenges in skin cancer diagnosis, including data augmentation, hybrid models, and feature fusion techniques.

Conclusion: Deep learning has transformative potential for skin disease diagnosis and clinical decision-making, but continued advancements are needed to fully realize this potential in dermatological care.

Abstract: Skin cancer is one of the most prevalent and deadly forms of cancer worldwide, which highlights the critical importance of early detection and diagnosis in improving patient outcomes. Deep learning (DL) has shown significant promise in enhancing the accuracy and efficiency of automated skin disease diagnosis, particularly in detecting and evaluating skin lesions and classification. However, there are still several challenges for DL-based skin cancer diagnosis, including complex features, image noise, intra-class variation, inter-class similarity, and data imbalance. By synthesizing recent research, this review discusses innovative approaches to cope with these challenges, such as data augmentation, hybrid models, and feature fusion, etc. Furthermore, the review highlights the integration of DL models into clinical workflows, offering insights into the potential of deep learning to revolutionize skin disease diagnosis and improve clinical decision-making. This article follows a comprehensive methodology based on the PRISMA framework and emphasizes the need for continued advancements to fully unlock the transformative potential of DL in dermatological care.

[290] SDAKD: Student Discriminator Assisted Knowledge Distillation for Super-Resolution Generative Adversarial Networks

Nikolaos Kaparinos, Vasileios Mezaris

Main category: cs.CV

TL;DR: SDAKD is a novel GAN distillation method that introduces a student discriminator to address capacity mismatch issues, achieving improved performance in image super-resolution tasks.

DetailsMotivation: GANs have high computational requirements that limit deployment on resource-constrained devices, and existing knowledge distillation methods struggle with capacity mismatch between student generators and teacher discriminators.

Method: Proposes Student Discriminator Assisted Knowledge Distillation (SDAKD) with a three-stage training strategy and adapted feature map distillation in the last two stages.

Result: Experiments on GCFSR and Real-ESRGAN show consistent improvements over baselines and state-of-the-art GAN knowledge distillation methods.

Conclusion: SDAKD effectively addresses the capacity mismatch problem in GAN distillation and enables more efficient deployment of GANs on resource-constrained devices.

Abstract: Generative Adversarial Networks (GANs) achieve excellent performance in generative tasks, such as image super-resolution, but their computational requirements make difficult their deployment on resource-constrained devices. While knowledge distillation is a promising research direction for GAN compression, effectively training a smaller student generator is challenging due to the capacity mismatch between the student generator and the teacher discriminator. In this work, we propose Student Discriminator Assisted Knowledge Distillation (SDAKD), a novel GAN distillation methodology that introduces a student discriminator to mitigate this capacity mismatch. SDAKD follows a three-stage training strategy, and integrates an adapted feature map distillation approach in its last two training stages. We evaluated SDAKD on two well-performing super-resolution GANs, GCFSR and Real-ESRGAN. Our experiments demonstrate consistent improvements over the baselines and SOTA GAN knowledge distillation methods. The SDAKD source code will be made openly available upon acceptance of the paper.

[291] PoseGaze-AHP: A Knowledge-Based 3D Dataset for AI-Driven Ocular and Postural Diagnosis

Saja Al-Dabet, Sherzod Turaev, Nazar Zaki, Arif O. Khan, Luai Eldweik

Main category: cs.CV

TL;DR: PoseGaze-AHP is a novel 3D dataset that synchronously captures head pose and gaze movement information for ocular-induced abnormal head posture assessment, created using LLM-extracted clinical data and Neural Head Avatar framework.

DetailsMotivation: Existing datasets focus on head pose and ocular movements separately, limiting integrated diagnostic approaches and AI advancements in abnormal head posture analysis.

Method: Clinical data extracted from medical literature using LLMs (Claude 3.5 Sonnet) with iterative prompting strategies, then transformed into 3D representations using Neural Head Avatar framework.

Result: Created dataset with 7,920 images from two head textures covering broad ocular conditions, with extraction method achieving 91.92% accuracy.

Conclusion: PoseGaze-AHP is the first publicly available resource for AI-driven ocular-induced AHP diagnosis, supporting development of accurate and privacy-compliant diagnostic tools.

Abstract: Diagnosing ocular-induced abnormal head posture (AHP) requires a comprehensive analysis of both head pose and ocular movements. However, existing datasets focus on these aspects separately, limiting the development of integrated diagnostic approaches and restricting AI-driven advancements in AHP analysis. To address this gap, we introduce PoseGaze-AHP, a novel 3D dataset that synchronously captures head pose and gaze movement information for ocular-induced AHP assessment. Structured clinical data were extracted from medical literature using large language models (LLMs) through an iterative process with the Claude 3.5 Sonnet model, combining stepwise, hierarchical, and complex prompting strategies. The extracted records were systematically imputed and transformed into 3D representations using the Neural Head Avatar (NHA) framework. The dataset includes 7,920 images generated from two head textures, covering a broad spectrum of ocular conditions. The extraction method achieved an overall accuracy of 91.92%, demonstrating its reliability for clinical dataset construction. PoseGaze-AHP is the first publicly available resource tailored for AI-driven ocular-induced AHP diagnosis, supporting the development of accurate and privacy-compliant diagnostic tools.

[292] DHQA-4D: Perceptual Quality Assessment of Dynamic 4D Digital Human

Yunhao Li, Sijing Wu, Yucheng Zhu, Huiyu Duan, Zicheng Zhang, Guangtao Zhai

Main category: cs.CV

TL;DR: This paper introduces DHQA-4D, a large-scale dataset for quality assessment of dynamic 4D digital humans, and proposes DynaMesh-Rater, a novel LMM-based method that extracts multi-dimensional features (visual, motion, geometry) to predict quality scores for both textured and non-textured 4D meshes.

DetailsMotivation: Dynamic 4D human avatars are prone to noise degradation during collection, compression, and transmission, which affects user viewing experience. Quality assessment of these digital humans is becoming increasingly important for applications in gaming, animation, and remote communication.

Method: Proposed DynaMesh-Rater: a large multimodal model (LMM) approach that extracts multi-dimensional features including visual features from projected 2D video, motion features from cropped video clips, and geometry features from 4D human mesh. Uses LoRA-based instruction tuning to teach the LMM to predict quality scores.

Result: Extensive experiments on the DHQA-4D dataset (containing 32 high-quality 4D sequences and 1920 distorted meshes) demonstrate the superiority of DynaMesh-Rater over previous quality assessment methods for both textured and non-textured 4D meshes.

Conclusion: The proposed DHQA-4D dataset and DynaMesh-Rater method effectively address the quality assessment challenge for dynamic 4D digital humans, providing comprehensive multi-dimensional feature analysis and superior performance compared to existing methods.

Abstract: With the rapid development of 3D scanning and reconstruction technologies, dynamic digital human avatars based on 4D meshes have become increasingly popular. A high-precision dynamic digital human avatar can be applied to various fields such as game production, animation generation, and remote immersive communication. However, these 4D human avatar meshes are prone to being degraded by various types of noise during the processes of collection, compression, and transmission, thereby affecting the viewing experience of users. In light of this fact, quality assessment of dynamic 4D digital humans becomes increasingly important. In this paper, we first propose a large-scale dynamic digital human quality assessment dataset, DHQA-4D, which contains 32 high-quality real-scanned 4D human mesh sequences, 1920 distorted textured 4D human meshes degraded by 11 textured distortions, as well as their corresponding textured and non-textured mean opinion scores (MOSs). Equipped with DHQA-4D dataset, we analyze the influence of different types of distortion on human perception for textured dynamic 4D meshes and non-textured dynamic 4D meshes. Additionally, we propose DynaMesh-Rater, a novel large multimodal model (LMM) based approach that is able to assess both textured 4D meshes and non-textured 4D meshes. Concretely, DynaMesh-Rater elaborately extracts multi-dimensional features, including visual features from a projected 2D video, motion features from cropped video clips, and geometry features from the 4D human mesh to provide comprehensive quality-related information. Then we utilize a LMM model to integrate the multi-dimensional features and conduct a LoRA-based instruction tuning technique to teach the LMM model to predict the quality scores. Extensive experimental results on the DHQA-4D dataset demonstrate the superiority of our DynaMesh-Rater method over previous quality assessment methods.

[293] Skin Lesion Classification Based on ResNet-50 Enhanced With Adaptive Spatial Feature Fusion

Runhao Liu, Ziming Chen, Peng Zhang

Main category: cs.CV

TL;DR: Improved ResNet-50 with Adaptive Spatial Feature Fusion (ASFF) achieves 93.18% accuracy for skin cancer classification by adaptively fusing multi-scale semantic and detail features to handle inter-class similarity and image noise.

DetailsMotivation: Skin cancer classification faces challenges due to high inter-class similarity, intra-class variability, and image noise in dermoscopic images, requiring better feature representation and reduced overfitting.

Method: Enhanced ResNet-50 with ASFF using dual-branch design that fuses high-level semantic and mid-level detail features through global average pooling and fully connected layers to generate adaptive weights for weighted fusion.

Result: Achieved 93.18% accuracy on ISIC 2020 dataset (3297 images), outperforming 5 classic CNNs with higher precision, recall, specificity, F1 score, and AUC values of 0.9670 (P-R) and 0.9717 (ROC).

Conclusion: The proposed ASFF-based ResNet-50 provides a more effective and efficient solution for computer-aided skin cancer diagnosis by adaptively focusing on lesion-relevant regions while suppressing background noise.

Abstract: Skin cancer classification remains a challenging problem due to high inter-class similarity, intra-class variability, and image noise in dermoscopic images. To address these issues, we propose an improved ResNet-50 model enhanced with Adaptive Spatial Feature Fusion (ASFF), which adaptively integrates multi-scale semantic and surface features to improve feature representation and reduce overfitting. The ResNet-50 model is enhanced with an adaptive feature fusion mechanism to achieve more effective multi-scale feature extraction and improve overall performance. Specifically, a dual-branch design fuses high-level semantic and mid-level detail features, which are processed through global average pooling and fully connected layers to generate adaptive weights for weighted fusion, thereby strengthening feature learning and reducing the impact of noise on classification. The method is evaluated on a subset of the ISIC 2020 dataset containing 3297 benign and malignant skin lesion images. Experimental results show that the proposed ASFF-based ResNet-50 achieves the best overall performance compared with 5 classic convolutional neural networks (CNNs) models. The proposed model reached an accuracy of 93.18% along with higher precision, recall, specificity, and F1 score. The improved model achieves an AUC value of 0.9670 and 0.9717 in the P-R and ROC curve, respectively. Then, the evaluation based on Grad-CAM further proved that the improved model adaptively focuses on lesion-relevant regions while suppressing irrelevant background information, thereby validating its enhanced feature learning capability from a deep representation perspective. These findings demonstrate that the proposed approach provides a more effective and efficient solution for computer-aided skin cancer diagnosis.

[294] Multi-Modal Oral Cancer Detection Using Weighted Ensemble Convolutional Neural Networks

Ajo Babu George, Sreehari J R Ajo Babu George, Sreehari J R Ajo Babu George, Sreehari J R

Main category: cs.CV

TL;DR: Multimodal deep learning framework using DenseNet-121 CNNs improves early detection of Oral Squamous Cell Carcinoma by integrating clinical, radiological, and histopathological images through weighted ensemble fusion.

DetailsMotivation: Late diagnosis of OSCC contributes to high mortality rates, with over 50% of cases detected at advanced stages and poor 5-year survival rates below 50%, highlighting the need for improved early detection methods.

Method: Retrospective study using publicly available datasets across three medical imaging modalities. Each modality trained a DenseNet-121 CNN via transfer learning with augmentation and modality-specific preprocessing. Predictions were fused using validation-weighted ensemble strategy.

Result: High validation accuracy for radiological (100%) and histopathological (95.12%) modalities, with clinical images performing lower (63.10%) due to visual heterogeneity. Ensemble model achieved 84.58% overall accuracy on multimodal validation dataset of 55 samples.

Conclusion: The multimodal ensemble framework provides a non-invasive, AI-assisted triage tool that enhances early identification of high-risk lesions, supports clinical decision-making, and aligns with global oncology guidelines to reduce diagnostic delays and improve patient outcomes.

Abstract: Aims Late diagnosis of Oral Squamous Cell Carcinoma (OSCC) contributes significantly to its high global mortality rate, with over 50% of cases detected at advanced stages and a 5-year survival rate below 50% according to WHO statistics. This study aims to improve early detection of OSCC by developing a multimodal deep learning framework that integrates clinical, radiological, and histopathological images using a weighted ensemble of DenseNet-121 convolutional neural networks (CNNs). Material and Methods A retrospective study was conducted using publicly available datasets representing three distinct medical imaging modalities. Each modality-specific dataset was used to train a DenseNet-121 CNN via transfer learning. Augmentation and modality-specific preprocessing were applied to increase robustness. Predictions were fused using a validation-weighted ensemble strategy. Evaluation was performed using accuracy, precision, recall, F1-score. Results High validation accuracy was achieved for radiological (100%) and histopathological (95.12%) modalities, with clinical images performing lower (63.10%) due to visual heterogeneity. The ensemble model demonstrated improved diagnostic robustness with an overall accuracy of 84.58% on a multimodal validation dataset of 55 samples. Conclusion The multimodal ensemble framework bridges gaps in the current diagnostic workflow by offering a non-invasive, AI-assisted triage tool that enhances early identification of high-risk lesions. It supports clinicians in decision-making, aligning with global oncology guidelines to reduce diagnostic delays and improve patient outcomes.

[295] Exploring Instruction Data Quality for Explainable Image Quality Assessment

Yunhao Li, Sijing Wu, Huiyu Duan, Yucheng Zhu, Qi Jia, Guangtao Zhai

Main category: cs.CV

TL;DR: The paper challenges the scaling law in explainable image quality assessment by showing that data quality matters more than quantity. It proposes IQA-Select, a clustering-based method that achieves better performance using only 10% of data.

DetailsMotivation: Current methods rely on large-scale instruction tuning datasets for MLLMs in explainable IQA, but this causes high computational costs and redundant data that can harm model performance. The authors aim to challenge the scaling law paradigm.

Method: Proposed IQA-Select, a clustering-based data selection framework with three stages: clustering feature extraction, cluster quota allocation, and cluster sampling strategy. The method systematically selects high-quality subsets from instruction tuning data.

Result: IQA-Select achieved 102.1% and 103.7% performance of full fine-tuning using only 10% selected data in Q-Bench and AesBench respectively, demonstrating superior efficiency and performance.

Conclusion: Data quality is more important than quantity for explainable IQA. The proposed IQA-Select method effectively reduces computational costs while improving performance by selecting high-quality data subsets.

Abstract: In recent years, with the rapid development of powerful multimodal large language models (MLLMs), explainable image quality assessment (IQA) has gradually become popular, aiming at providing quality-related descriptions and answers of images. To achieve this goal, recent methods seek to construct a large-scale instruction tuning dataset to empower the MLLM with quality perception ability following the well-known scaling law. However, a large amount of instruction tuning data may cause substantial computational costs and redundant data, which in turn will cause harm to the performance of the model. To cope with this problem, in this paper, we challenge the scaling law and systematically investigate the role of data quality of the instruction tuning dataset for explainable IQA. Using a powerful pre-trained MLLM, we first investigate the changes in model performance after fine-tuning with different sizes of instruction tuning data. We find that selecting a subset of the data set randomly using an appropriate ratio can even lead to better results than training with the entire instruction tuning dataset, demonstrating the redundancy of current explainable IQA instruction tuning data. Beyond randomly sampling a subset, we propose a clustering-based data selection framework with three stages: clustering feature extraction, cluster quota allocation, and cluster sampling strategy. Then we systematically analyze the choices of each stage and propose a simple but efficient data selection method IQA-Select for explainable IQA. The experimental results demonstrate that IQA-Select can achieve 102.1% and 103.7% performance of full fine-tuning using only 10% selected data in Q-Bench and AesBench respectively, significantly reducing computational costs while achieving better performance.

[296] Bridge Thinking and Acting: Unleashing Physical Potential of VLM with Generalizable Action Expert

Mingyu Liu, Zheng Huang, Xiaoyi Lin, Muzhi Zhu, Canyu Zhao, Zongze Du, Yating Wang, Haoyi Zhu, Hao Chen, Chunhua Shen

Main category: cs.CV

TL;DR: A new Vision-Language-Action framework that uses sparse 3D trajectories as an intermediate representation to bridge high-level planning with low-level physical actions, enabling better generalization without requiring fine-tuning for new environments.

DetailsMotivation: Current VLA models generalize poorly due to monolithic architectures constrained by scarce data, while dual-system approaches suffer from semantic ambiguities in action modules that make cross-task training infeasible.

Method: Uses sparse 3D waypoints generated by VLM as intermediate representation, processed by a generalizable action expert that refines them into dense action sequences using real-time point cloud observations. Implements “Action Pre-training, Pointcloud Fine-tuning” paradigm.

Result: The framework effectively bridges VLM planning capabilities with physical action execution, enabling robust generalization across tasks and environments without requiring fine-tuning on new data.

Conclusion: The approach successfully combines the broad generalization of VLMs with fine-grained action-level generalization, overcoming limitations of previous VLA models and dual-system approaches.

Abstract: Although Vision-Language Models (VLM) have demonstrated impressive planning and reasoning capabilities, translating these abilities into the physical world introduces significant challenges. Conventional Vision-Language-Action (VLA) models, which integrate reasoning and action into a monolithic architecture, generalize poorly because they are constrained by scarce, narrow-domain data. While recent dual-system approaches attempt to decouple “thinking” from “acting”, they are often constrained by semantic ambiguities within the action module. This ambiguity makes large-scale, cross-task training infeasible. Consequently, these systems typically necessitate fine-tuning on newly collected data when deployed to novel environments, and the cooperation mechanism between the two systems remains ill-defined. To address these limitations, we introduce, for the first time, a framework centered around a generalizable action expert. Our approach utilizes sparse 3D trajectories as an intermediate representation, effectively bridging the high-level planning capabilities of the VLM with the low-level physical action module. During the planning phase, the VLM is only required to generate coarse 3D waypoints. These waypoints are then processed by our generalizable action expert, which refines them into dense, executable action sequences by sampling real-time point cloud observations of the environment. To promote training efficiency and robust generalization, we introduce a novel “Action Pre-training, Pointcloud Fine-tuning” paradigm. Our method combines the broad generalization capabilities of VLMs in visual understanding and planning with the fine-grained, action-level generalization of action expert.

[297] Zero-Shot Fine-Grained Image Classification Using Large Vision-Language Models

Md. Atabuzzaman, Andrew Zhang, Chris Thomas

Main category: cs.CV

TL;DR: A novel method transforms zero-shot fine-grained image classification into visual question-answering using LVLMs, enhanced by attention intervention and improved class descriptions, achieving state-of-the-art performance.

DetailsMotivation: The potential of Large Vision-Language Models (LVLMs) for zero-shot fine-grained image classification remains underexplored, despite their impressive performance on vision-language reasoning tasks.

Method: Transforms zero-shot fine-grained image classification into a visual question-answering framework, leverages LVLMs’ comprehensive understanding capabilities, and enhances performance through novel attention intervention technique and improved class description benchmarks.

Result: The proposed method consistently outperforms the current state-of-the-art approach across multiple fine-grained image classification benchmarks.

Conclusion: Demonstrates both the effectiveness of the proposed method and the broader potential of LVLMs for zero-shot fine-grained classification tasks.

Abstract: Large Vision-Language Models (LVLMs) have demonstrated impressive performance on vision-language reasoning tasks. However, their potential for zero-shot fine-grained image classification, a challenging task requiring precise differentiation between visually similar categories, remains underexplored. We present a novel method that transforms zero-shot fine-grained image classification into a visual question-answering framework, leveraging LVLMs’ comprehensive understanding capabilities rather than relying on direct class name generation. We enhance model performance through a novel attention intervention technique. We also address a key limitation in existing datasets by developing more comprehensive and precise class description benchmarks. We validate the effectiveness of our method through extensive experimentation across multiple fine-grained image classification benchmarks. Our proposed method consistently outperforms the current state-of-the-art (SOTA) approach, demonstrating both the effectiveness of our method and the broader potential of LVLMs for zero-shot fine-grained classification tasks. Code and Datasets: https://github.com/Atabuzzaman/Fine-grained-classification

[298] From Filters to VLMs: Benchmarking Defogging Methods through Object Detection and Segmentation Performance

Ardalan Aryashad, Parsa Razmara, Amin Mahjoub, Seyedarmin Azizi, Mahdi Salmani, Arad Firouzkouhi

Main category: cs.CV

TL;DR: A comprehensive benchmark study comparing various defogging methods for autonomous driving perception, evaluating both image quality and downstream task performance on object detection and segmentation.

DetailsMotivation: Autonomous driving perception systems are vulnerable in foggy conditions, and improvements in image fidelity from defogging don't consistently translate to better downstream detection and segmentation performance. Prior evaluations often rely on synthetic data, raising questions about real-world transferability.

Method: Structured empirical study benchmarking: (i) classical filters, (ii) modern defogging networks, (iii) chained variants (filter→model, model→filter), and (iv) prompt-driven visual-language image editing models applied directly to foggy images. Evaluation uses Foggy Cityscapes dataset.

Result: Analysis reveals when defogging helps, when chaining yields synergy or degradation, and how VLM-based editors compare to dedicated approaches. VLM judge qualitative scores show strong correlations with mAP metrics.

Conclusion: Establishes a transparent, task-oriented benchmark for defogging methods and highlights conditions under which preprocessing genuinely improves autonomous perception in adverse weather.

Abstract: Autonomous driving perception systems are particularly vulnerable in foggy conditions, where light scattering reduces contrast and obscures fine details critical for safe operation. While numerous defogging methods exist-from handcrafted filters to learned restoration models-improvements in image fidelity do not consistently translate into better downstream detection and segmentation. Moreover, prior evaluations often rely on synthetic data, leaving questions about real-world transferability. We present a structured empirical study that benchmarks a comprehensive set of pipelines, including (i) classical filters, (ii) modern defogging networks, (iii) chained variants (filter$\rightarrow$model, model$\rightarrow$filter), and (iv) prompt-driven visual–language image editing models (VLM) applied directly to foggy images. Using Foggy Cityscapes, we assess both image quality and downstream performance on object detection (mAP) and segmentation (PQ, RQ, SQ). Our analysis reveals when defogging helps, when chaining yields synergy or degradation, and how VLM-based editors compare to dedicated approaches. In addition, we evaluate qualitative rubric-based scores from a VLM judge and quantify their alignment with task metrics, showing strong correlations with mAP. Together, these results establish a transparent, task-oriented benchmark for defogging methods and highlight the conditions under which preprocessing genuinely improves autonomous perception in adverse weather.

[299] Generating Human Motion Videos using a Cascaded Text-to-Video Framework

Hyelin Nam, Hyojun Go, Byeongjun Park, Byung-Hoon Kim, Hyungjin Chung

Main category: cs.CV

TL;DR: CAMEO is a cascaded framework that bridges Text-to-Motion models and conditional Video Diffusion Models for general human motion video generation, addressing alignment issues and introducing camera-aware conditioning.

DetailsMotivation: Human video generation has broad applications but current video diffusion models are underexplored for general-purpose human video generation, being mostly limited to image-to-video setups or narrow domains like dance videos.

Method: A cascaded framework that analyzes and prepares textual prompts and visual conditions to train VDMs, with a camera-aware conditioning module that automatically selects viewpoints aligned with input text to connect the two stages.

Result: The approach demonstrates effectiveness on both the MovieGen benchmark and a newly introduced benchmark for T2M-VDM combination, showing versatility across diverse use cases.

Conclusion: CAMEO successfully bridges T2M models and conditional VDMs for robust general human motion video generation, mitigating suboptimal factors through carefully designed components including camera-aware conditioning.

Abstract: Human video generation is becoming an increasingly important task with broad applications in graphics, entertainment, and embodied AI. Despite the rapid progress of video diffusion models (VDMs), their use for general-purpose human video generation remains underexplored, with most works constrained to image-to-video setups or narrow domains like dance videos. In this work, we propose CAMEO, a cascaded framework for general human motion video generation. It seamlessly bridges Text-to-Motion (T2M) models and conditional VDMs, mitigating suboptimal factors that may arise in this process across both training and inference through carefully designed components. Specifically, we analyze and prepare both textual prompts and visual conditions to effectively train the VDM, ensuring robust alignment between motion descriptions, conditioning signals, and the generated videos. Furthermore, we introduce a camera-aware conditioning module that connects the two stages, automatically selecting viewpoints aligned with the input text to enhance coherence and reduce manual intervention. We demonstrate the effectiveness of our approach on both the MovieGen benchmark and a newly introduced benchmark tailored to the T2M-VDM combination, while highlighting its versatility across diverse use cases.

[300] OpenFLAME: Federated Visual Positioning System to Enable Large-Scale Augmented Reality Applications

Sagar Bharadwaj, Harrison Williams, Luke Wang, Michael Liang, Tao Jin, Srinivasan Seshan, Anthony Rowe

Main category: cs.CV

TL;DR: OpenFLAME is a federated Visual Positioning System (VPS) backend that enables distributed 6DoF localization for AR applications across multiple organizations’ spaces without centralized 3D scanning.

DetailsMotivation: Current centralized VPS solutions from large companies fail to cover private indoor spaces due to privacy concerns, regulations, and maintenance bottlenecks, limiting AR application coverage.

Method: Proposes federated image-based localization where independent organizations maintain separate VPS services for their own spaces, with solutions for managing and merging data across maps without sharing private data.

Result: Enables access control of indoor 3D scans, distributed VPS maintenance, and encourages larger coverage while addressing challenges like localization coherency, quality control, and service selection.

Conclusion: OpenFLAME provides a scalable federated approach to VPS that overcomes limitations of centralized systems by allowing distributed maintenance and coverage expansion while preserving privacy.

Abstract: World-scale augmented reality (AR) applications need a ubiquitous 6DoF localization backend to anchor content to the real world consistently across devices. Large organizations such as Google and Niantic are 3D scanning outdoor public spaces in order to build their own Visual Positioning Systems (VPS). These centralized VPS solutions fail to meet the needs of many future AR applications – they do not cover private indoor spaces because of privacy concerns, regulations, and the labor bottleneck of updating and maintaining 3D scans. In this paper, we present OpenFLAME, a federated VPS backend that allows independent organizations to 3D scan and maintain a separate VPS service for their own spaces. This enables access control of indoor 3D scans, distributed maintenance of the VPS backend, and encourages larger coverage. Sharding of VPS services introduces several unique challenges – coherency of localization results across spaces, quality control of VPS services, selection of the right VPS service for a location, and many others. We introduce the concept of federated image-based localization and provide reference solutions for managing and merging data across maps without sharing private data.

[301] Talking Tennis: Language Feedback from 3D Biomechanical Action Recognition

Arushi Dashore, Aryan Anumala, Emily Hui, Olivia Yang

Main category: cs.CV

TL;DR: A framework that combines biomechanical motion analysis with deep learning and LLMs to generate actionable feedback for tennis players and coaches, bridging the gap between technical analysis and practical guidance.

DetailsMotivation: Existing tennis stroke analysis systems fail to connect biomechanical insights with accessible, meaningful feedback for players and coaches, creating a gap between technical analysis and practical application.

Method: Uses CNN-LSTM models to extract biomechanical features (joint angles, limb velocities, kinetic chain patterns) from motion data, then analyzes these features for stroke effectiveness and injury risk, and generates feedback using large language models.

Result: The framework produces technically accurate, biomechanically grounded, and actionable feedback for end-users, evaluated on classification performance and interpretability.

Conclusion: The approach successfully bridges explainable AI and sports biomechanics by connecting deep learning analysis with practical, language-based feedback for tennis stroke improvement.

Abstract: Automated tennis stroke analysis has advanced significantly with the integration of biomechanical motion cues alongside deep learning techniques, enhancing stroke classification accuracy and player performance evaluation. Despite these advancements, existing systems often fail to connect biomechanical insights with actionable language feedback that is both accessible and meaningful to players and coaches. This research project addresses this gap by developing a novel framework that extracts key biomechanical features (such as joint angles, limb velocities, and kinetic chain patterns) from motion data using Convolutional Neural Network Long Short-Term Memory (CNN-LSTM)-based models. These features are analyzed for relationships influencing stroke effectiveness and injury risk, forming the basis for feedback generation using large language models (LLMs). Leveraging the THETIS dataset and feature extraction techniques, our approach aims to produce feedback that is technically accurate, biomechanically grounded, and actionable for end-users. The experimental setup evaluates this framework on classification performance and interpretability, bridging the gap between explainable AI and sports biomechanics.

[302] Harnessing Synthetic Preference Data for Enhancing Temporal Understanding of Video-LLMs

Sameep Vani, Shreyas Jena, Maitreya Patel, Chitta Baral, Somak Aditya, Yezhou Yang

Main category: cs.CV

TL;DR: TimeWarp creates synthetic temporal datasets to improve Video-LLMs’ fine-grained temporal understanding, addressing their current limitation of relying too much on language reasoning rather than video dynamics.

DetailsMotivation: Video-LLMs underperform on tasks requiring fine-grained temporal understanding due to lack of visual complexity and temporal nuance in current datasets, causing them to rely heavily on language-based reasoning instead of understanding video dynamics.

Method: Proposed TimeWarp - a systematic method to create targeted synthetic temporal datasets for fine-tuning, and introduced a large-scale preference dataset capturing intricate temporal dynamics to ground model responses to visual and temporal information.

Result: Applied to existing models, significantly improved performance on temporal understanding benchmarks with absolute improvement across seven benchmarks.

Conclusion: TimeWarp’s synthetic temporal datasets effectively advance temporal understanding in Video-LLMs, demonstrating the importance of targeted dataset creation for improving fine-grained video understanding capabilities.

Abstract: While Video Large Language Models (Video-LLMs) have demonstrated remarkable performance across general video understanding benchmarks-particularly in video captioning and descriptive tasks-they consistently underperform on tasks that require fine-grained temporal understanding. This limitation arises due to the lack of visual complexity and temporal nuance in current fine-tuning datasets, leading these models to rely heavily on language-based reasoning rather than truly understanding video dynamics. In this work, we propose TimeWarp, a systematic method to create a targeted synthetic temporal dataset to fine-tune the model’s responses to encourage it to focus on the given input video. We introduce a large-scale preference dataset, created using TimeWarp, that captures intricate temporal dynamics often overlooked, grounding the model’s responses to visual and temporal information. We demonstrate that when our method is applied to existing models, it significantly improves performance on temporal understanding benchmarks, highlighting the effectiveness of our proposed datasets in advancing temporal understanding in Video-LLMs, resulting in an absolute improvement in performance across seven benchmarks. Code is available at https://github.com/sameepv21/timewarp.

[303] No Tokens Wasted: Leveraging Long Context in Biomedical Vision-Language Models

Min Woo Sun, Alejandro Lozano, Javier Gamazo Tejero, Vishwesh Nath, Xiao Xiao Sun, James Burgess, Yuhui Zhang, Kun Yuan, Robert Tibshirani, Sean Huver, Serena Yeung-Levy

Main category: cs.CV

TL;DR: The paper introduces BMC-LongCLIP, a biomedical vision-language model with extended context length (512 tokens) to handle long-format captions, reducing token waste from 55% to 2.2% and achieving significant performance gains in retrieval and classification tasks.

DetailsMotivation: Standard VLMs use short text windows (<77 tokens), forcing truncation of long biomedical captions, but analysis shows many biomedical captions exceed this limit, suggesting potential benefits from longer context modeling.

Method: Extended text encoder context length in VLMs to 512 tokens, pretrained on BIOMEDICA-LongCAP dataset containing 1M image-caption pairs with context-aware descriptions from full-text articles.

Result: BMC-LongCLIP achieved up to +30% absolute gains in Recall@1 for long-caption retrieval, +2% average improvements in classification, faster convergence, and reduced token waste from 55% to 2.2%.

Conclusion: Long-context modeling is a promising direction for advancing biomedical VLMs, as longer context enables additional supervision from long-format captions and correlates with better performance.

Abstract: Embedding vision-language models (VLMs) are typically pretrained with short text windows (<77 tokens), which forces the truncation of long-format captions. Yet, the distribution of biomedical captions from large-scale open source literature reveals that a huge portion of captions far exceed 77 tokens. To this end, we investigate the impact of pretraining on long-format biomedical captions by extending the context length of text encoders in VLMs. We find that longer context (thus, enabling additional supervision provided in long-format captions) correlates with better retrieval and classification performance. Given this finding, we introduce BIOMEDICA-LongCAP, a dataset of 1M image-caption pairs enriched with context-aware descriptions from full-text articles, providing longer and additional textual supervision. Using BIOMEDICA-LongCAP, we train BMC-LongCLIP, a long-context biomedical VLM with a text encoder supporting windows of up to 512 tokens. Our model extends context capacity by 6.6x, reducing token waste from 55% to just 2.2%. On long-caption retrieval benchmarks, BMC-LongCLIP achieves up to +30% absolute gains in Recall@1 and +2% average improvements in classification, while also converging faster than short-context. Our results demonstrate that long-context modeling is a promising direction for advancing biomedical VLMs.

[304] Keep It on a Leash: Controllable Pseudo-label Generation Towards Realistic Long-Tailed Semi-Supervised Learning

Yaxin Hou, Bo Han, Yuheng Jia, Hui Liu, Junhui Hou

Main category: cs.CV

TL;DR: CPG is a framework for long-tailed semi-supervised learning that handles unknown unlabeled data distributions by dynamically generating pseudo-labels and maintaining a known labeled data distribution through controllable filtering.

DetailsMotivation: Existing methods assume unlabeled data follows predefined distributions, but in reality, unlabeled data distribution is unknown and arbitrary, creating challenges for reliable pseudo-label generation.

Method: Uses a controllable self-reinforcing optimization cycle: dynamic filtering of pseudo-labels to maintain known distribution, Bayes-optimal classifier with logit adjustment, and class-aware adaptive augmentation for minority classes.

Result: Achieves consistent improvements across benchmark datasets, surpassing state-of-the-art methods by up to 15.97% in accuracy.

Conclusion: CPG effectively handles unknown unlabeled data distributions through its controllable pseudo-label generation framework and theoretically reduces generalization error.

Abstract: Current long-tailed semi-supervised learning methods assume that labeled data exhibit a long-tailed distribution, and unlabeled data adhere to a typical predefined distribution (i.e., long-tailed, uniform, or inverse long-tailed). However, the distribution of the unlabeled data is generally unknown and may follow an arbitrary distribution. To tackle this challenge, we propose a Controllable Pseudo-label Generation (CPG) framework, expanding the labeled dataset with the progressively identified reliable pseudo-labels from the unlabeled dataset and training the model on the updated labeled dataset with a known distribution, making it unaffected by the unlabeled data distribution. Specifically, CPG operates through a controllable self-reinforcing optimization cycle: (i) at each training step, our dynamic controllable filtering mechanism selectively incorporates reliable pseudo-labels from the unlabeled dataset into the labeled dataset, ensuring that the updated labeled dataset follows a known distribution; (ii) we then construct a Bayes-optimal classifier using logit adjustment based on the updated labeled data distribution; (iii) this improved classifier subsequently helps identify more reliable pseudo-labels in the next training step. We further theoretically prove that this optimization cycle can significantly reduce the generalization error under some conditions. Additionally, we propose a class-aware adaptive augmentation module to further improve the representation of minority classes, and an auxiliary branch to maximize data utilization by leveraging all labeled and unlabeled samples. Comprehensive evaluations on various commonly used benchmark datasets show that CPG achieves consistent improvements, surpassing state-of-the-art methods by up to \textbf{15.97%} in accuracy. The code is available at https://github.com/yaxinhou/CPG.

[305] Enhancing OCR for Sino-Vietnamese Language Processing via Fine-tuned PaddleOCRv5

Minh Hoang Nguyen, Su Nguyen Thiet

Main category: cs.CV

TL;DR: Fine-tuning PaddleOCRv5 improves Classical Chinese (Han-Nom) text recognition accuracy from 37.5% to 50.0% on degraded historical Vietnamese documents.

DetailsMotivation: Existing OCR systems struggle with degraded scans, non-standard glyphs, and handwriting variations in ancient Vietnamese Chinese manuscripts, hindering digitization and cross-lingual semantic research.

Method: Fine-tuned PaddleOCRv5’s text recognition module using curated Han-Nom manuscripts, with full training pipeline including preprocessing, LMDB conversion, evaluation, and visualization.

Result: Significant improvement in character recognition accuracy from 37.5% to 50.0%, especially under noisy image conditions. Developed interactive demo for visual comparison.

Conclusion: The fine-tuned model enables better digitization of historical documents and supports downstream applications like Han-Vietnamese semantic alignment, machine translation, and historical linguistics research.

Abstract: Recognizing and processing Classical Chinese (Han-Nom) texts play a vital role in digitizing Vietnamese historical documents and enabling cross-lingual semantic research. However, existing OCR systems struggle with degraded scans, non-standard glyphs, and handwriting variations common in ancient sources. In this work, we propose a fine-tuning approach for PaddleOCRv5 to improve character recognition on Han-Nom texts. We retrain the text recognition module using a curated subset of ancient Vietnamese Chinese manuscripts, supported by a full training pipeline covering preprocessing, LMDB conversion, evaluation, and visualization. Experimental results show a significant improvement over the base model, with exact accuracy increasing from 37.5 percent to 50.0 percent, particularly under noisy image conditions. Furthermore, we develop an interactive demo that visually compares pre- and post-fine-tuning recognition results, facilitating downstream applications such as Han-Vietnamese semantic alignment, machine translation, and historical linguistics research. The demo is available at https://huggingface.co/spaces/MinhDS/Fine-tuned-PaddleOCRv5.

[306] Fit Pixels, Get Labels: Meta-learned Implicit Networks for Image Segmentation

Kushal Vyas, Ashok Veeraraghavan, Guha Balakrishnan

Main category: cs.CV

TL;DR: MetaSeg is a meta-learning framework that trains implicit neural representations (INRs) for medical image segmentation, achieving comparable performance to U-Net models with 90% fewer parameters.

DetailsMotivation: INRs are effective for compact signal representations but not naturally suitable for predictive tasks like segmentation that require learning semantic structures across signal distributions.

Method: Uses an INR that simultaneously predicts pixel intensities and class labels, with meta-learning to find optimal initial parameters over training data, enabling quick fine-tuning for unseen test images.

Result: Achieved Dice scores comparable to U-Net models on 2D and 3D brain MRI segmentation tasks while using 90% fewer parameters.

Conclusion: MetaSeg provides a scalable alternative to resource-heavy architectures like U-Nets and vision transformers for medical image segmentation.

Abstract: Implicit neural representations (INRs) have achieved remarkable successes in learning expressive yet compact signal representations. However, they are not naturally amenable to predictive tasks such as segmentation, where they must learn semantic structures over a distribution of signals. In this study, we introduce MetaSeg, a meta-learning framework to train INRs for medical image segmentation. MetaSeg uses an underlying INR that simultaneously predicts per pixel intensity values and class labels. It then uses a meta-learning procedure to find optimal initial parameters for this INR over a training dataset of images and segmentation maps, such that the INR can simply be fine-tuned to fit pixels of an unseen test image, and automatically decode its class labels. We evaluated MetaSeg on 2D and 3D brain MRI segmentation tasks and report Dice scores comparable to commonly used U-Net models, but with $90%$ fewer parameters. MetaSeg offers a fresh, scalable alternative to traditional resource-heavy architectures such as U-Nets and vision transformers for medical image segmentation. Our project is available at https://kushalvyas.github.io/metaseg.html .

[307] Video-in-the-Loop: Span-Grounded Long Video QA with Interleaved Reasoning

Chendong Wang, Donglin Bai, Yifan Yang, Xiao Jin, Anlan Zhang, Rui Wang, Shiqi Jiang, Yuqing Yang, Hao Wu, Qi Dai, Chong Luo, Ting Cao, Lili Qiu, Suman Banerjee

Main category: cs.CV

TL;DR: ViTL is a two-stage framework for long-video QA that localizes relevant intervals with low-fps skimming and then answers by reallocating visual tokens at higher frame rates, achieving better performance with fewer frames.

DetailsMotivation: To address the computational challenges of long-video QA while maintaining interpretability and efficiency.

Method: Two-stage approach: 1) Localize question-relevant intervals using low-fps skim, 2) Answer via span-aware reallocation of visual tokens at higher effective frame rate with interleaved group-relative objective.

Result: Achieves up to 8.6% improvement with 50% less frame input on long-video QA and temporal grounding tasks; span-aware token reallocation consistently outperforms uniform sampling.

Conclusion: ViTL provides an interpretable, compute-efficient solution for scalable long-video QA when combined with the new dataset.

Abstract: We present \emph{Video-in-the-Loop} (ViTL), a two-stage long-video QA framework that preserves a fixed token budget by first \emph{localizing} question-relevant interval(s) with a low-fps skim and then \emph{answering} via span-aware reallocation of visual tokens at higher effective frame rate, emitting an interleaved output with both spans and the final option for direct attribution. We also introduce \dataname{}, which converts description based event graphs into \emph{span-grounded} multiple-choice QA by pairing each question with \emph{ground-truth} time span(s) and related reasoning. ViTL is trained end-to-end with an interleaved group-relative objective that couples temporal IoU for localization with answer correctness, allowing credit to flow from answers back to spans without increasing compute. Under fixed token budgets, ViTL attains up to 8.6% with 50% less frame input on long-video QA and temporal grounding (e.g., Charades-STA, ActivityNet-Captions) and ablations show that span-aware token reallocation consistently surpasses uniform sampling. Together, \dataname{} and ViTL provide an interpretable, compute-efficient recipe for scalable long-video QA.

[308] Prompt-to-Prompt: Text-Based Image Editing Via Cross-Attention Mechanisms – The Research of Hyperparameters and Novel Mechanisms to Enhance Existing Frameworks

Linn Bieske, Carla Lorente

Main category: cs.CV

TL;DR: This paper enhances precision in prompt-to-prompt image editing by optimizing hyperparameters and developing new methods to address variability and inconsistency issues in stable diffusion models.

DetailsMotivation: To improve the precision and reliability of prompt-to-prompt image editing frameworks by addressing variability in results and inconsistencies introduced by current deep learning methods using cross-attention mechanisms.

Method: Conducted comprehensive study of ‘word swap’ method, developed ‘attention re-weight method’ for better adaptability, and proposed ‘CL P2P’ framework to address limitations like cycle inconsistency.

Result: Enhanced understanding of hyperparameter interactions with neural network attention mechanisms, leading to improved composition and quality of generated images in prompt-to-prompt editing.

Conclusion: The research contributes to better understanding and optimization of hyperparameter settings in relation to attention mechanisms, significantly improving the precision and reliability of text-driven image editing frameworks.

Abstract: Recent advances in image editing have shifted from manual pixel manipulation to employing deep learning methods like stable diffusion models, which now leverage cross-attention mechanisms for text-driven control. This transition has simplified the editing process but also introduced variability in results, such as inconsistent hair color changes. Our research aims to enhance the precision and reliability of prompt-to-prompt image editing frameworks by exploring and optimizing hyperparameters. We present a comprehensive study of the “word swap” method, develop an “attention re-weight method” for better adaptability, and propose the “CL P2P” framework to address existing limitations like cycle inconsistency. This work contributes to understanding and improving the interaction between hyperparameter settings and the architectural choices of neural network models, specifically their attention mechanisms, which significantly influence the composition and quality of the generated images.

[309] \textsc{GUI-Spotlight}: Adaptive Iterative Focus Refinement for Enhanced GUI Visual Grounding

Bin Lei, Nuo Xu, Ali Payani, Mingyi Hong, Chunhua Liao, Yu Cao, Caiwen Ding

Main category: cs.CV

TL;DR: GUI-Spotlight is a multimodal model that improves visual grounding accuracy for GUI systems by dynamically invoking specialized tools to iteratively narrow focus on relevant screen regions, achieving state-of-the-art performance with minimal training data.

DetailsMotivation: Current MLLMs for GUI systems have limited practical usefulness due to unreliable visual grounding, which prevents accurate pointer-level actions like clicking or dragging.

Method: Train a model for image-grounded reasoning that dynamically invokes multiple specialized tools to iteratively narrow focus to relevant screen regions.

Result: Achieves 52.8% accuracy on ScreenSpot-Pro benchmark with only 18.5K training samples, outperforming V2P-7B (50.6% with 9.6M samples) and GTA-1-7B (50.1% with 1.56M samples).

Conclusion: GUI-Spotlight substantially improves visual grounding accuracy for GUI systems through iterative region focusing, enabling more reliable pointer-level actions with minimal training data.

Abstract: Multimodal large language models (MLLMs) have markedly expanded the competence of graphical user-interface (GUI) systems, propelling them beyond controlled simulations into complex, real-world environments across diverse platforms. However, practical usefulness is still bounded by the reliability of visual grounding, i.e., mapping textual references to exact on-screen elements. This limitation prevents the system from accurately performing pointer-level actions such as clicking or dragging. To address it, we introduce GUI-Spotlight – a model trained for image-grounded reasoning that dynamically invokes multiple specialized tools to iteratively narrow its focus to the relevant region of the screen, thereby substantially improving visual grounding accuracy. On the ScreenSpot-Pro benchmark, GUI-Spotlight trained with only 18.5K training samples achieves 52.8% accuracy, surpassing V2P-7B (50.6% with 9.6M training samples) and GTA-1-7B (50.1% with 1.56M training samples).

[310] Quantization Range Estimation for Convolutional Neural Networks

Bingtao Yang, Yujia Wang, Mengzhi Jiao, Hongwei Huo

Main category: cs.CV

TL;DR: Proposes a range estimation method for post-training quantization that minimizes quantization errors through layer-wise local minima optimization, achieving state-of-the-art accuracy in 4-8 bit quantization for image classification models.

DetailsMotivation: Low-bit quantization while maintaining model accuracy is challenging in post-training quantization for reducing deep neural network storage.

Method: Models range estimation as an optimization problem minimizing quantization errors by layer-wise local minima, proves local convexity, and presents an efficient search algorithm applied to transformed weights space.

Result: Outperforms state-of-the-art on top-1 accuracy for ResNet series and Inception-v3 models, with almost no loss in 8-bit/6-bit settings and significant improvement in 4-bit quantization for image classification.

Conclusion: The proposed range estimation method effectively improves quantization performance in post-training quantization with minimal accuracy loss across various bit-widths.

Abstract: Post-training quantization for reducing the storage of deep neural network models has been demonstrated to be an effective way in various tasks. However, low-bit quantization while maintaining model accuracy is a challenging problem. In this paper, we present a range estimation method to improve the quantization performance for post-training quantization. We model the range estimation into an optimization problem of minimizing quantization errors by layer-wise local minima. We prove this problem is locally convex and present an efficient search algorithm to find the optimal solution. We propose the application of the above search algorithm to the transformed weights space to do further improvement in practice. Our experiments demonstrate that our method outperforms state-of-the-art performance generally on top-1 accuracy for image classification tasks on the ResNet series models and Inception-v3 model. The experimental results show that the proposed method has almost no loss of top-1 accuracy in 8-bit and 6-bit settings for image classifications, and the accuracy of 4-bit quantization is also significantly improved. The code is available at https://github.com/codeiscommitting/REQuant.

[311] MetaFind: Scene-Aware 3D Asset Retrieval for Coherent Metaverse Scene Generation

Zhenyu Pan, Yucheng Lu, Han Liu

Main category: cs.CV

TL;DR: MetaFind is a tri-modal compositional retrieval framework for 3D asset retrieval in metaverse scene generation, addressing inconsistent asset retrieval and lack of standardized 3D retrieval paradigms.

DetailsMotivation: To solve inconsistent 3D asset retrieval that ignores spatial, semantic, and stylistic constraints, and the absence of standardized retrieval methods specifically designed for 3D assets.

Method: Uses flexible tri-modal retrieval with text, image, and 3D queries, featuring ESSGNN layout encoder that captures spatial relationships and appearance features, supporting iterative scene construction.

Result: Empirical evaluations show improved spatial and stylistic consistency compared to baseline methods in various retrieval tasks.

Conclusion: MetaFind provides an effective framework for contextually and stylistically coherent 3D asset retrieval in metaverse scene generation.

Abstract: We present MetaFind, a scene-aware tri-modal compositional retrieval framework designed to enhance scene generation in the metaverse by retrieving 3D assets from large-scale repositories. MetaFind addresses two core challenges: (i) inconsistent asset retrieval that overlooks spatial, semantic, and stylistic constraints, and (ii) the absence of a standardized retrieval paradigm specifically tailored for 3D asset retrieval, as existing approaches mainly rely on general-purpose 3D shape representation models. Our key innovation is a flexible retrieval mechanism that supports arbitrary combinations of text, image, and 3D modalities as queries, enhancing spatial reasoning and style consistency by jointly modeling object-level features (including appearance) and scene-level layout structures. Methodologically, MetaFind introduces a plug-and-play equivariant layout encoder ESSGNN that captures spatial relationships and object appearance features, ensuring retrieved 3D assets are contextually and stylistically coherent with the existing scene, regardless of coordinate frame transformations. The framework supports iterative scene construction by continuously adapting retrieval results to current scene updates. Empirical evaluations demonstrate the improved spatial and stylistic consistency of MetaFind in various retrieval tasks compared to baseline methods.

[312] Ordinal Encoding as a Regularizer in Binary Loss for Solar Flare Prediction

Chetraj Pandey, Jinsu Hong, Anli Ji, Rafal A. Angryk, Berkay Aydin

Main category: cs.CV

TL;DR: Proposes an ordinality-aware loss function that integrates ordinal relationships between solar flare sub-classes into binary classification, penalizing misclassifications near the threshold more heavily.

DetailsMotivation: Binary classification for solar flare prediction ignores ordinal relationships between sub-classes, and models struggle most with events near the prediction threshold.

Method: Modified loss function that integrates ordinal information into binary cross-entropy loss, serving as an ordinality-aware regularization method during model optimization.

Result: The approach aims to improve model performance by leveraging ordinal characteristics and penalizing threshold-near misclassifications more heavily.

Conclusion: Incorporating ordinal weighting into loss functions can enhance solar flare prediction models by better handling the inherent ordinal relationships in flare intensity data.

Abstract: The prediction of solar flares is typically formulated as a binary classification task, distinguishing events as either Flare (FL) or No-Flare (NF) according to a specified threshold (for example, greater than or equal to C-class, M-class, or X-class). However, this binary framework neglects the inherent ordinal relationships among the sub-classes contained within each category (FL and NF). Several studies on solar flare prediction have empirically shown that the most frequent misclassifications occur near this prediction threshold. This suggests that the models struggle to differentiate events that are similar in intensity but fall on opposite sides of the binary threshold. To mitigate this limitation, we propose a modified loss function that integrates the ordinal information among the sub-classes of the binarized flare labels into the conventional binary cross-entropy (BCE) loss. This approach serves as an ordinality-aware, data-driven regularization method that penalizes the incorrect predictions of flare events in close proximity to the prediction threshold more heavily than those away from the boundary during model optimization. By incorporating ordinal weighting into the loss function, we aim to enhance the model’s learning process by leveraging the ordinal characteristics of the data, thereby improving its overall performance.

[313] QuantDemoire: Quantization with Outlier Aware for Image Demoiréing

Zheng Chen, Kewei Zhang, Xiaoyang Liu, Weihang Zhang, Mengfan Wang, Yifan Fu, Yulun Zhang

Main category: cs.CV

TL;DR: QuantDemoire is a post-training quantization framework for demoiréing models that addresses performance degradation issues through outlier-aware quantization and frequency-aware calibration, achieving significant computational savings while maintaining quality.

DetailsMotivation: Existing deep learning-based demoiréing methods require substantial computational resources, limiting deployment on edge devices. Direct application of standard quantization methods causes severe performance degradation due to distribution outliers and weakened representations in smooth regions.

Method: Proposes two key components: (1) Outlier-aware quantizer using sampling-based range estimation to reduce activation outliers and keeping extreme weights in FP16, (2) Frequency-aware calibration strategy that emphasizes low- and mid-frequency components during fine-tuning to mitigate banding artifacts.

Result: QuantDemoire achieves large reductions in parameters and computation while maintaining quality, outperforming existing quantization methods by over 4 dB on W4A4 (4-bit weight and activation quantization).

Conclusion: The proposed QuantDemoire framework effectively addresses quantization challenges for demoiréing models, enabling efficient deployment on edge devices without compromising performance.

Abstract: Demoir'eing aims to remove moir'e artifacts that often occur in images. While recent deep learning-based methods have achieved promising results, they typically require substantial computational resources, limiting their deployment on edge devices. Model quantization offers a compelling solution. However, directly applying existing quantization methods to demoir'eing models introduces severe performance degradation. The main reasons are distribution outliers and weakened representations in smooth regions. To address these issues, we propose QuantDemoire, a post-training quantization framework tailored to demoir'eing. It contains two key components. **First}, we introduce an outlier-aware quantizer to reduce errors from outliers. It uses sampling-based range estimation to reduce activation outliers, and keeps a few extreme weights in FP16 with negligible cost. Second, we design a frequency-aware calibration strategy. It emphasizes low- and mid-frequency components during fine-tuning, which mitigates banding artifacts caused by low-bit quantization. Extensive experiments validate that our QuantDemoire achieves large reductions in parameters and computation while maintaining quality. Meanwhile, it outperforms existing quantization methods by over 4 dB on W4A4. Code is released at: https://github.com/zhengchen1999/QuantDemoire.

[314] Diffusion Low Rank Hybrid Reconstruction for Sparse View Medical Imaging

Zongyin Deng, Qing Zhou, Yuhao Fang, Zijian Wang, Yao Lu, Ye Zhang, Chun Li

Main category: cs.CV

TL;DR: TV-LoRA is a novel method for low-dose sparse-view CT reconstruction that combines diffusion generative priors with multi-regularization constraints (anisotropic TV and nuclear norm) in an ADMM framework, achieving superior performance in texture recovery and artifact suppression.

DetailsMotivation: To address the ill-posedness and texture loss problems in extremely sparse-view CT reconstruction under low-dose conditions, where traditional methods struggle to maintain image quality.

Method: Combines diffusion generative prior (NCSN++ with SDE modeling) with multi-regularization constraints including anisotropic TV and nuclear norm (LoRA) in an ADMM framework. Uses 2D slice-based strategy with FFT acceleration and tensor-parallel optimization for efficient inference.

Result: Experiments on AAPM-2016, CTHD, and LIDC datasets with 8, 4, and 2 views show TV-LoRA consistently surpasses benchmarks in SSIM, texture recovery, edge clarity, and artifact suppression, demonstrating strong robustness and generalizability.

Conclusion: TV-LoRA achieves high-fidelity, efficient 3D CT reconstruction with broad clinical applicability in low-dose, sparse-sampling scenarios, with ablation studies confirming the complementary effects of LoRA regularization and diffusion priors.

Abstract: This work presents TV-LoRA, a novel method for low-dose sparse-view CT reconstruction that combines a diffusion generative prior (NCSN++ with SDE modeling) and multi-regularization constraints, including anisotropic TV and nuclear norm (LoRA), within an ADMM framework. To address ill-posedness and texture loss under extremely sparse views, TV-LoRA integrates generative and physical constraints, and utilizes a 2D slice-based strategy with FFT acceleration and tensor-parallel optimization for efficient inference. Experiments on AAPM-2016, CTHD, and LIDC datasets with $N_{\mathrm{view}}=8,4,2$ show that TV-LoRA consistently surpasses benchmarks in SSIM, texture recovery, edge clarity, and artifact suppression, demonstrating strong robustness and generalizability. Ablation studies confirm the complementary effects of LoRA regularization and diffusion priors, while the FFT-PCG module provides a speedup. Overall, Diffusion + TV-LoRA achieves high-fidelity, efficient 3D CT reconstruction and broad clinical applicability in low-dose, sparse-sampling scenarios.

[315] TOPO-Bench: An Open-Source Topological Mapping Evaluation Framework with Quantifiable Perceptual Aliasing

Jiaming Wang, Diwen Liu, Jizhuo Chen, Harold Soh

Main category: cs.CV

TL;DR: This paper addresses the lack of standardized evaluation in topological mapping by proposing topological consistency as a fundamental metric, introducing a quantitative measure for dataset ambiguity, and releasing a benchmark dataset with calibrated ambiguity levels.

DetailsMotivation: Progress in topological mapping is hindered by the absence of standardized evaluation metrics, datasets, and protocols, preventing fair comparisons between systems. Perceptual aliasing remains under-quantified despite its significant impact on performance.

Method: The authors formalize topological consistency as the core property of topological maps and show localization accuracy serves as an efficient surrogate metric. They propose the first quantitative measure of dataset ambiguity and curate a diverse benchmark dataset with calibrated ambiguity levels.

Result: The paper implements and releases deep-learned baseline systems and evaluates them alongside classical methods, yielding new insights into the limitations of current approaches under perceptual aliasing conditions.

Conclusion: All datasets, baselines, and evaluation tools are open-sourced to promote consistent and reproducible research in topological mapping, addressing the field’s standardization challenges.

Abstract: Topological mapping offers a compact and robust representation for navigation, but progress in the field is hindered by the lack of standardized evaluation metrics, datasets, and protocols. Existing systems are assessed using different environments and criteria, preventing fair and reproducible comparisons. Moreover, a key challenge - perceptual aliasing - remains under-quantified, despite its strong influence on system performance. We address these gaps by (1) formalizing topological consistency as the fundamental property of topological maps and showing that localization accuracy provides an efficient and interpretable surrogate metric, and (2) proposing the first quantitative measure of dataset ambiguity to enable fair comparisons across environments. To support this protocol, we curate a diverse benchmark dataset with calibrated ambiguity levels, implement and release deep-learned baseline systems, and evaluate them alongside classical methods. Our experiments and analysis yield new insights into the limitations of current approaches under perceptual aliasing. All datasets, baselines, and evaluation tools are fully open-sourced to foster consistent and reproducible research in topological mapping.

[316] Learning Efficient Meshflow and Optical Flow from Event Cameras

Xinglong Luo, Ao Luo, Kunming Luo, Zhengning Wang, Ping Tan, Bing Zeng, Shuaicheng Liu

Main category: cs.CV

TL;DR: This paper introduces event-based meshflow estimation, creates the HREM/HREM+ datasets, and proposes EEMFlow/EEMFlow+ networks with density adaptation for efficient motion field prediction from event cameras.

DetailsMotivation: Address two key gaps in event-based flow estimation: lack of meshflow-specific datasets/methods and underexplored event data density challenges.

Method: Created HREM dataset (1280x720 resolution), proposed EEMFlow network with encoder-decoder architecture, added CDC module for dense flow, and developed ADM for density adaptation.

Result: EEMFlow achieves 30x faster runtime than SOTA methods. ADM improves EEMFlow and EEMFlow+ performance by 8% and 10% respectively.

Conclusion: The proposed methods successfully address meshflow estimation and data density challenges, demonstrating superior efficiency and performance in event-based motion field prediction.

Abstract: In this paper, we explore the problem of event-based meshflow estimation, a novel task that involves predicting a spatially smooth sparse motion field from event cameras. To start, we review the state-of-the-art in event-based flow estimation, highlighting two key areas for further research: i) the lack of meshflow-specific event datasets and methods, and ii) the underexplored challenge of event data density. First, we generate a large-scale High-Resolution Event Meshflow (HREM) dataset, which showcases its superiority by encompassing the merits of high resolution at 1280x720, handling dynamic objects and complex motion patterns, and offering both optical flow and meshflow labels. These aspects have not been fully explored in previous works. Besides, we propose Efficient Event-based MeshFlow (EEMFlow) network, a lightweight model featuring a specially crafted encoder-decoder architecture to facilitate swift and accurate meshflow estimation. Furthermore, we upgrade EEMFlow network to support dense event optical flow, in which a Confidence-induced Detail Completion (CDC) module is proposed to preserve sharp motion boundaries. We conduct comprehensive experiments to show the exceptional performance and runtime efficiency (30x faster) of our EEMFlow model compared to the recent state-of-the-art flow method. As an extension, we expand HREM into HREM+, a multi-density event dataset contributing to a thorough study of the robustness of existing methods across data with varying densities, and propose an Adaptive Density Module (ADM) to adjust the density of input event data to a more optimal range, enhancing the model’s generalization ability. We empirically demonstrate that ADM helps to significantly improve the performance of EEMFlow and EEMFlow+ by 8% and 10%, respectively. Code and dataset are released at https://github.com/boomluo02/EEMFlowPlus.

[317] Joint Learning of Pose Regression and Denoising Diffusion with Score Scaling Sampling for Category-level 6D Pose Estimation

Seunghyun Lee, Tae-Kyun Kim

Main category: cs.CV

TL;DR: A novel diffusion-based pipeline for category-level 6D object pose estimation that accelerates training convergence and eliminates the need for additional evaluation networks through pretrained encoders and time-dependent sampling guidance.

DetailsMotivation: Existing diffusion models for 6D pose estimation suffer from slow training convergence, end-to-end encoder learning with diffusion networks, and require additional networks to filter low-quality pose candidates.

Method: Pretrains encoder with direct pose regression head, jointly learns networks via regression and diffusion heads, and introduces sampling guidance with time-dependent score scaling to balance exploration-exploitation trade-off.

Result: Achieves state-of-the-art accuracies on REAL275, HouseCat6D, and ROPE benchmarks with single-pose inference, while being more efficient in both training and inference.

Conclusion: The proposed method is simple yet effective, accelerating training convergence while maintaining multi-modal characteristics for symmetric objects and ensuring high-quality pose generation.

Abstract: Latest diffusion models have shown promising results in category-level 6D object pose estimation by modeling the conditional pose distribution with depth image input. The existing methods, however, suffer from slow convergence during training, learning its encoder with the diffusion denoising network in end-to-end fashion, and require an additional network that evaluates sampled pose hypotheses to filter out low-quality pose candidates. In this paper, we propose a novel pipeline that tackles these limitations by two key components. First, the proposed method pretrains the encoder with the direct pose regression head, and jointly learns the networks via the regression head and the denoising diffusion head, significantly accelerating training convergence while achieving higher accuracy. Second, sampling guidance via time-dependent score scaling is proposed s.t. the exploration-exploitation trade-off is effectively taken, eliminating the need for the additional evaluation network. The sampling guidance maintains multi-modal characteristics of symmetric objects at early denoising steps while ensuring high-quality pose generation at final steps. Extensive experiments on multiple benchmarks including REAL275, HouseCat6D, and ROPE, demonstrate that the proposed method, simple yet effective, achieves state-of-the-art accuracies even with single-pose inference, while being more efficient in both training and inference.

[318] Learning from All: Concept Alignment for Autonomous Distillation from Multiple Drifting MLLMs

Xiaoyu Yang, Jie Lu, En Yu

Main category: cs.CV

TL;DR: This paper addresses concept drift in multimodal large language model (MLLM) distillation, where multiple teachers’ reasoning trajectories drift unpredictably, causing bias transmission to student models. The authors propose autonomous preference optimization (APO) using a “learn, compare, critique” paradigm to align concepts and improve model robustness.

DetailsMotivation: The paper identifies that reasoning trajectories from multiple MLLM teachers exhibit concept drift, where their reasoning distributions evolve unpredictably and transmit biases to student models, compromising performance in knowledge distillation.

Method: The authors establish a theoretical connection between concept drift and knowledge distillation, framing non-stationary reasoning dynamics as next-token prediction of multi-stream reasoning trajectories. They introduce the “learn, compare, critique” paradigm with autonomous preference optimization (APO), where students learn from teachers, compare multiple teachers, and critically reflect on drifting inferences to perform concept alignment.

Result: Extensive experiments demonstrate superior performance in consistency, robustness, and generalization within knowledge distillation. The authors also contribute CXR-MAX, a large-scale dataset with 170,982 distilled reasoning trajectories derived from publicly accessible MLLMs based on MIMIC-CXR.

Conclusion: The proposed autonomous preference optimization approach effectively addresses concept drift in MLLM distillation, yielding robust, consistent, and generalizable student models through the “learn, compare, critique” paradigm and concept alignment.

Abstract: This paper identifies a critical yet underexplored challenge in distilling from multimodal large language models (MLLMs): the reasoning trajectories generated by multiple drifting teachers exhibit concept drift, whereby their reasoning distributions evolve unpredictably and transmit biases to the student model, ultimately compromising its performance. To tackle this issue, we pioneer a theoretical connection between concept drift and knowledge distillation, casting the non-stationary reasoning dynamics from multiple MLLM teachers as next-token prediction of multi-stream reasoning trajectories.Guided by concept drift, we introduce the “learn, compare, critique” paradigm, culminating in autonomous preference optimization (APO). Under the active guidance of the teachers, the student model first learns and self-distils preferred thinking by comparing multiple teachers. It then engages in critical reflection over the drifting inference from teachers, performing concept alignment through APO, ultimately yielding a robust, consistent, and generalizable model.Extensive experiments demonstrate our superior performance of consistency, robustness and generalization within knowledge distillation. Besides, we also contributed a large-scale dataset, CXR-MAX (Multi-teachers Alignment X-rays), comprising 170,982 distilled reasoning trajectories derived from publicly accessible MLLMs based on MIMIC-CXR. Our code and data are public at: https://anonymous.4open.science/r/Autonomous-Distillation/.

[319] Automating construction safety inspections using a multi-modal vision-language RAG framework

Chenxin Wang, Elyas Asadi Shamsabadi, Zhaohui Chen, Luming Shen, Alireza Ahmadian Fard Fini, Daniel Dias-da-Costa

Main category: cs.CV

TL;DR: SiteShield is a multi-modal LVLM-based RAG framework that automates construction safety inspection reports by integrating visual and audio inputs, outperforming unimodal LLMs.

DetailsMotivation: Conventional construction safety inspection methods are inefficient due to large information volume. Existing LVLM applications face limitations like irrelevant responses, restricted modal inputs, and hallucinations, while LLMs lack training data and real-time adaptability.

Method: Developed SiteShield, a multi-modal LVLM-based Retrieval-Augmented Generation (RAG) framework that integrates visual and audio inputs for construction safety inspection automation.

Result: SiteShield outperformed unimodal LLMs without RAG with F1 score of 0.82, hamming loss of 0.04, precision of 0.76, and recall of 0.96 using real-world data.

Conclusion: SiteShield offers a novel pathway to enhance information retrieval and efficiency in generating safety reports for construction safety inspections.

Abstract: Conventional construction safety inspection methods are often inefficient as they require navigating through large volume of information. Recent advances in large vision-language models (LVLMs) provide opportunities to automate safety inspections through enhanced visual and linguistic understanding. However, existing applications face limitations including irrelevant or unspecific responses, restricted modal inputs and hallucinations. Utilisation of Large Language Models (LLMs) for this purpose is constrained by availability of training data and frequently lack real-time adaptability. This study introduces SiteShield, a multi-modal LVLM-based Retrieval-Augmented Generation (RAG) framework for automating construction safety inspection reports by integrating visual and audio inputs. Using real-world data, SiteShield outperformed unimodal LLMs without RAG with an F1 score of 0.82, hamming loss of 0.04, precision of 0.76, and recall of 0.96. The findings indicate that SiteShield offers a novel pathway to enhance information retrieval and efficiency in generating safety reports.

[320] BLADE: Bias-Linked Adaptive DEbiasing

Piyush Arora, Navlika Singh, Vasubhya Diwan, Pratik Mazumder

Main category: cs.CV

TL;DR: BLADE is a generative debiasing framework that mitigates neural network biases without requiring prior knowledge of biases or bias-conflicting samples, using adaptive image refinement and alignment strategies.

DetailsMotivation: Neural networks often learn implicit biases and spurious correlations from training data, relying on superficial patterns rather than task-relevant features. Existing methods require impractical assumptions like prior bias knowledge or access to bias-conflicting samples.

Method: BLADE trains a generative model to translate images across bias domains while preserving task-relevant features, then adaptively refines images with synthetic counterparts based on bias susceptibility. It aligns images with bias-translated counterparts sharing task features but differing in bias, while misaligning with same-bias samples.

Result: BLADE significantly outperforms state-of-the-art methods on multiple benchmark datasets, exceeding the closest baseline by ~18% absolute margin on corrupted CIFAR-10 under worst group setting.

Conclusion: BLADE establishes a new benchmark in bias mitigation and demonstrates potential for developing more robust deep learning models without explicit supervision.

Abstract: Neural networks have revolutionized numerous fields, yet they remain vulnerable to a critical flaw: the tendency to learn implicit biases, spurious correlations between certain attributes and target labels in training data. These biases are often more prevalent and easier to learn, causing models to rely on superficial patterns rather than task-relevant features necessary for generalization. Existing methods typically rely on strong assumptions, such as prior knowledge of these biases or access to bias-conflicting samples, i.e., samples that contradict spurious correlations and counterbalance bias-aligned samples, samples that conform to these spurious correlations. However, such assumptions are often impractical in real-world settings. We propose BLADE ({B}ias-{L}inked {A}daptive {DE}biasing), a generative debiasing framework that requires no prior knowledge of bias or bias-conflicting samples. BLADE first trains a generative model to translate images across bias domains while preserving task-relevant features. Then, it adaptively refines each image with its synthetic counterpart based on the image’s susceptibility to bias. To encourage robust representations, BLADE aligns an image with its bias-translated synthetic counterpart that shares task-relevant features but differs in bias, while misaligning it with samples sharing the same bias. We evaluate BLADE on multiple benchmark datasets and show that it significantly outperforms state-of-the-art methods. Notably, it exceeds the closest baseline by an absolute margin of around 18% on the corrupted CIFAR-10 dataset under the worst group setting, establishing a new benchmark in bias mitigation and demonstrating its potential for developing more robust deep learning models without explicit supervision.

[321] From Segments to Concepts: Interpretable Image Classification via Concept-Guided Segmentation

Ran Eisenberg, Amit Rozner, Ethan Fetaya, Ofir Lindenbaum

Main category: cs.CV

TL;DR: SEG-MIL-CBM integrates concept-guided segmentation with multiple instance learning to provide spatially grounded concept explanations without requiring concept annotations, improving interpretability and robustness.

DetailsMotivation: Deep neural networks lack interpretability and may exploit misleading features, limiting trust in safety-critical applications. Existing Concept Bottleneck Models require costly concept annotations and lack spatial grounding.

Method: Combines concept-guided image segmentation with attention-based multiple instance learning, treating segmented regions as instances and aggregating evidence across them to identify task-relevant concepts.

Result: Achieves robust performance across spurious correlations, input corruptions, and large-scale benchmarks while providing transparent concept-level explanations without concept annotations.

Conclusion: SEG-MIL-CBM enables spatially grounded concept reasoning without manual annotations, improving both interpretability and robustness in deep learning models.

Abstract: Deep neural networks have achieved remarkable success in computer vision; however, their black-box nature in decision-making limits interpretability and trust, particularly in safety-critical applications. Interpretability is crucial in domains where errors have severe consequences. Existing models not only lack transparency but also risk exploiting unreliable or misleading features, which undermines both robustness and the validity of their explanations. Concept Bottleneck Models (CBMs) aim to improve transparency by reasoning through human-interpretable concepts. Still, they require costly concept annotations and lack spatial grounding, often failing to identify which regions support each concept. We propose SEG-MIL-CBM, a novel framework that integrates concept-guided image segmentation into an attention-based multiple instance learning (MIL) framework, where each segmented region is treated as an instance and the model learns to aggregate evidence across them. By reasoning over semantically meaningful regions aligned with high-level concepts, our model highlights task-relevant evidence, down-weights irrelevant cues, and produces spatially grounded, concept-level explanations without requiring annotations of concepts or groups. SEG-MIL-CBM achieves robust performance across settings involving spurious correlations (unintended dependencies between background and label), input corruptions (perturbations that degrade visual quality), and large-scale benchmarks, while providing transparent, concept-level explanations.

[322] Let Features Decide Their Own Solvers: Hybrid Feature Caching for Diffusion Transformers

Shikang Zheng, Guantao Chen, Qinming Zhou, Yuqi Lin, Lixuan He, Chang Zou, Peiliang Cai, Jiacheng Liu, Linfeng Zhang

Main category: cs.CV

TL;DR: HyCa is a hybrid ODE solver-inspired caching framework that applies dimension-wise caching strategies to accelerate Diffusion Transformers, achieving 5-6x speedup across various models without retraining.

DetailsMotivation: Diffusion Transformers have state-of-the-art fidelity but suffer from slow iterative sampling due to expensive transformer forward passes at each timestep. Existing caching methods use uniform strategies that ignore heterogeneous feature dynamics.

Method: Model hidden feature evolution as a mixture of ODEs across dimensions and introduce HyCa, a hybrid ODE solver-inspired caching framework that applies dimension-wise caching strategies.

Result: Achieves near-lossless acceleration: 5.55x speedup on FLUX, 5.56x on HunyuanVideo, 6.24x on Qwen-Image and Qwen-Image-Edit without retraining.

Conclusion: HyCa provides effective training-free acceleration for Diffusion Transformers by modeling feature dynamics with dimension-wise ODEs and hybrid caching strategies.

Abstract: Diffusion Transformers offer state-of-the-art fidelity in image and video synthesis, but their iterative sampling process remains a major bottleneck due to the high cost of transformer forward passes at each timestep. To mitigate this, feature caching has emerged as a training-free acceleration technique that reuses or forecasts hidden representations. However, existing methods often apply a uniform caching strategy across all feature dimensions, ignoring their heterogeneous dynamic behaviors. Therefore, we adopt a new perspective by modeling hidden feature evolution as a mixture of ODEs across dimensions, and introduce HyCa, a Hybrid ODE solver inspired caching framework that applies dimension-wise caching strategies. HyCa achieves near-lossless acceleration across diverse domains and models, including 5.55 times speedup on FLUX, 5.56 times speedup on HunyuanVideo, 6.24 times speedup on Qwen-Image and Qwen-Image-Edit without retraining.

[323] World-To-Image: Grounding Text-to-Image Generation with Agent-Driven World Knowledge

Moo Hyun Son, Jintaek Oh, Sun Bin Mun, Jaechul Roh, Sehyun Choi

Main category: cs.CV

TL;DR: World-To-Image is a framework that enhances text-to-image generation by using web-searching agents to retrieve images for novel concepts, enabling multimodal prompt optimization and achieving significant improvements in semantic accuracy.

DetailsMotivation: Text-to-image models struggle with novel or out-of-distribution entities due to knowledge cutoffs, leading to degraded performance when prompted with unfamiliar concepts.

Method: An agent dynamically searches the web to retrieve images for unknown concepts, then performs multimodal prompt optimization to steer generative models toward accurate synthesis.

Result: World-To-Image achieves +8.1% improvement in accuracy-to-prompt on the NICE benchmark, outperforming state-of-the-art methods in both semantic alignment and visual aesthetics with high efficiency in under three iterations.

Conclusion: The framework enables T2I systems to better reflect the ever-changing real world by bridging knowledge gaps through agent-driven web retrieval and multimodal optimization.

Abstract: While text-to-image (T2I) models can synthesize high-quality images, their performance degrades significantly when prompted with novel or out-of-distribution (OOD) entities due to inherent knowledge cutoffs. We introduce World-To-Image, a novel framework that bridges this gap by empowering T2I generation with agent-driven world knowledge. We design an agent that dynamically searches the web to retrieve images for concepts unknown to the base model. This information is then used to perform multimodal prompt optimization, steering powerful generative backbones toward an accurate synthesis. Critically, our evaluation goes beyond traditional metrics, utilizing modern assessments like LLMGrader and ImageReward to measure true semantic fidelity. Our experiments show that World-To-Image substantially outperforms state-of-the-art methods in both semantic alignment and visual aesthetics, achieving +8.1% improvement in accuracy-to-prompt on our curated NICE benchmark. Our framework achieves these results with high efficiency in less than three iterations, paving the way for T2I systems that can better reflect the ever-changing real world. Our demo code is available here\footnote{https://github.com/mhson-kyle/World-To-Image}.

[324] MASC: Boosting Autoregressive Image Generation with a Manifold-Aligned Semantic Clustering

Lixuan He, Shikang Zheng, Linfeng Zhang

Main category: cs.CV

TL;DR: MASC is a framework that constructs a hierarchical semantic tree from token embeddings to simplify autoregressive image generation, improving training efficiency by 57% and reducing FID from 2.87 to 2.58.

DetailsMotivation: Autoregressive models for image generation are inefficient due to treating visual tokens as a flat vocabulary, ignoring the semantic structure of the embedding space, which complicates prediction and limits performance.

Method: MASC uses geometry-aware distance metrics and density-driven agglomerative construction to build a hierarchical semantic tree that models the underlying manifold of token embeddings, transforming flat prediction into structured hierarchical prediction.

Result: MASC accelerates training by up to 57% and improves generation quality, reducing FID of LlamaGen-XL from 2.87 to 2.58, making AR frameworks competitive with state-of-the-art methods.

Conclusion: Structuring the prediction space through hierarchical semantic organization is as crucial as architectural innovation for scalable generative modeling, with MASC providing a plug-and-play solution that significantly enhances AR image generation.

Abstract: Autoregressive (AR) models have shown great promise in image generation, yet they face a fundamental inefficiency stemming from their core component: a vast, unstructured vocabulary of visual tokens. This conventional approach treats tokens as a flat vocabulary, disregarding the intrinsic structure of the token embedding space where proximity often correlates with semantic similarity. This oversight results in a highly complex prediction task, which hinders training efficiency and limits final generation quality. To resolve this, we propose Manifold-Aligned Semantic Clustering (MASC), a principled framework that constructs a hierarchical semantic tree directly from the codebook’s intrinsic structure. MASC employs a novel geometry-aware distance metric and a density-driven agglomerative construction to model the underlying manifold of the token embeddings. By transforming the flat, high-dimensional prediction task into a structured, hierarchical one, MASC introduces a beneficial inductive bias that significantly simplifies the learning problem for the AR model. MASC is designed as a plug-and-play module, and our extensive experiments validate its effectiveness: it accelerates training by up to 57% and significantly improves generation quality, reducing the FID of LlamaGen-XL from 2.87 to 2.58. MASC elevates existing AR frameworks to be highly competitive with state-of-the-art methods, establishing that structuring the prediction space is as crucial as architectural innovation for scalable generative modeling.

[325] Zoom-In to Sort AI-Generated Images Out

Yikun Ji, Yan Hong, Bowen Deng, jun lan, Huijia Zhu, Weiqiang Wang, Liqing Zhang, Jianfu Zhang

Main category: cs.CV

TL;DR: ZoomIn is a two-stage forensic framework that improves AI-generated image detection accuracy and interpretability by first scanning for suspicious regions and then performing focused analysis on zoomed-in areas.

DetailsMotivation: The rapid growth of AI-generated imagery has blurred boundaries between real and synthetic content, raising digital integrity concerns. Current vision-language models often fail to detect subtle artifacts in high-quality synthetic images.

Method: Proposed ZoomIn framework mimics human visual inspection with two stages: scanning images to locate suspicious regions, then performing focused analysis on zoomed-in areas. Created MagniFake dataset of 20,000 real/synthetic images with bounding boxes and explanations using automated VLM-based pipeline.

Result: Achieved 96.39% accuracy with robust generalization. Provides human-understandable explanations grounded in visual evidence.

Conclusion: ZoomIn effectively addresses both detection accuracy and interpretability challenges in AI-generated image forensics through its two-stage approach and comprehensive dataset.

Abstract: The rapid growth of AI-generated imagery has blurred the boundary between real and synthetic content, raising critical concerns for digital integrity. Vision-language models (VLMs) offer interpretability through explanations but often fail to detect subtle artifacts in high-quality synthetic images. We propose ZoomIn, a two-stage forensic framework that improves both accuracy and interpretability. Mimicking human visual inspection, ZoomIn first scans an image to locate suspicious regions and then performs a focused analysis on these zoomed-in areas to deliver a grounded verdict. To support training, we introduce MagniFake, a dataset of 20,000 real and high-quality synthetic images annotated with bounding boxes and forensic explanations, generated through an automated VLM-based pipeline. Our method achieves 96.39% accuracy with robust generalization, while providing human-understandable explanations grounded in visual evidence.

[326] A Recursive Pyramidal Algorithm for Solving the Image Registration Problem

Stefan Dirnstorfer

Main category: cs.CV

TL;DR: A simple, end-to-end trainable image registration algorithm that requires minimal code, training data, and training time while achieving accurate results.

DetailsMotivation: To create an accessible and efficient solution for image registration that works with limited resources and can be easily implemented.

Method: An end-to-end trainable algorithm implemented in a few lines of Python code, demonstrated with stereo vision using 74 images on a 19x15 input window.

Result: The algorithm achieves accurate image registration results with very little training data and training time, excelling in brevity and simplicity.

Conclusion: This simple algorithm serves as an excellent starting point for image registration tasks where training data, training time, or code complexity are constrained.

Abstract: The problem of image registration is finding a transformation that aligns two images, such that the corresponding points are in the same location. This paper introduces a simple, end-to-end trainable algorithm that is implementable in a few lines of Python code. The approach is shown to work with very little training data and training time, while achieving accurate results in some settings. An example application to stereo vision was trained from 74 images on a 19x15 input window. With just a dozen lines of Python code this algorithm excels in brevity and may serve as a good start in related scenarios with limitations to training data, training time or code complexity.

[327] Detection of retinal diseases using an accelerated reused convolutional network

Amin Ahmadi Kasani, Hedieh Sajedi

Main category: cs.CV

TL;DR: The paper introduces ArConv layers to redesign and optimize convolutional layers, creating a lightweight model with only 1.3M parameters that achieves better accuracy than MobileNetV2 on eye disease detection tasks.

DetailsMotivation: To improve accessibility of deep neural networks for eye disease detection by reducing computational complexity while maintaining high accuracy, making models suitable for mobile devices.

Method: Redesigned and optimized convolutional layers by creating novel ArConv layers, resulting in a new general model architecture with reduced parameter count.

Result: The proposed model with 1.3M parameters achieved 0.9328 accuracy on RfMiD test set, outperforming MobileNetV2 (2.2M parameters) which achieved 0.9266 accuracy under identical conditions.

Conclusion: The ArConv-based model successfully balances computational efficiency and accuracy, making deep learning more accessible for eye disease diagnosis on mobile platforms.

Abstract: Convolutional neural networks are continually evolving, with some efforts aimed at improving accuracy, others at increasing speed, and some at enhancing accessibility. Improving accessibility broadens the application of neural networks across a wider range of tasks, including the detection of eye diseases. Early diagnosis of eye diseases and consulting an ophthalmologist can prevent many vision disorders. Given the importance of this issue, various datasets have been collected from the cornea to facilitate the process of making neural network models. However, most of the methods introduced in the past are computationally complex. In this study, we tried to increase the accessibility of deep neural network models. We did this at the most fundamental level, specifically by redesigning and optimizing the convolutional layers. By doing so, we created a new general model that incorporates our novel convolutional layer named ArConv layers. Thanks to the efficient performance of this new layer, the model has suitable complexity for use in mobile phones and can perform the task of diagnosing the presence of disease with high accuracy. The final model we present contains only 1.3 million parameters. In comparison to the MobileNetV2 model, which has 2.2 million parameters, our model demonstrated better accuracy when trained and evaluated on the RfMiD dataset under identical conditions, achieving an accuracy of 0.9328 versus 0.9266 on the RfMiD test set.

[328] Scaling Sequence-to-Sequence Generative Neural Rendering

Shikun Liu, Kam Woh Ng, Wonbong Jang, Jiadong Guo, Junlin Han, Haozhe Liu, Yiannis Douratsos, Juan C. Pérez, Zijian Zhou, Chi Phung, Tao Xiang, Juan-Manuel Pérez-Rúa

Main category: cs.CV

TL;DR: Kaleido is a generative model for neural rendering that treats 3D as a specialized video domain, using sequence-to-sequence image synthesis to perform view synthesis without explicit 3D representations.

DetailsMotivation: To create a unified framework for photorealistic object- and scene-level neural rendering that can leverage large-scale video data for pre-training and reduce reliance on scarce 3D datasets.

Method: Uses a masked autoregressive framework with decoder-only rectified flow transformer to generate arbitrary 6-DoF views conditioned on reference views, unifying 3D and video modeling.

Result: Sets new state-of-the-art on view synthesis benchmarks, with zero-shot performance outperforming other generative methods in few-view settings and matching per-scene optimization methods in many-view settings.

Conclusion: Kaleido successfully demonstrates that 3D rendering can be effectively treated as a video sequence task, enabling unified modeling and leveraging video data to improve performance while reducing dependency on 3D datasets.

Abstract: We present Kaleido, a family of generative models designed for photorealistic, unified object- and scene-level neural rendering. Kaleido operates on the principle that 3D can be regarded as a specialised sub-domain of video, expressed purely as a sequence-to-sequence image synthesis task. Through a systemic study of scaling sequence-to-sequence generative neural rendering, we introduce key architectural innovations that enable our model to: i) perform generative view synthesis without explicit 3D representations; ii) generate any number of 6-DoF target views conditioned on any number of reference views via a masked autoregressive framework; and iii) seamlessly unify 3D and video modelling within a single decoder-only rectified flow transformer. Within this unified framework, Kaleido leverages large-scale video data for pre-training, which significantly improves spatial consistency and reduces reliance on scarce, camera-labelled 3D datasets – all without any architectural modifications. Kaleido sets a new state-of-the-art on a range of view synthesis benchmarks. Its zero-shot performance substantially outperforms other generative methods in few-view settings, and, for the first time, matches the quality of per-scene optimisation methods in many-view settings.

[329] The best performance in the CARE 2025 – Liver Task (LiSeg-Contrast): Contrast-Aware Semi-Supervised Segmentation with Domain Generalization and Test-Time Adaptation

Jincan Lou, Jingkun Chen, Haoquan Li, Hang Li, Wenjian Huang, Weihua Chen, Fan Wang, Jianguo Zhang

Main category: cs.CV

TL;DR: CoSSeg-TTA is a compact liver segmentation framework for MRI that addresses domain shifts and limited annotated data through semi-supervised learning, domain adaptation, and test-time adaptation.

DetailsMotivation: Liver segmentation from contrast-enhanced MRI is challenging due to limited annotated data, heterogeneous enhancement protocols, and domain shifts across scanners/institutions. Traditional image translation methods have limitations like requiring registration, structural distortions, and unstable training.

Method: Built on nnU-Netv2 with semi-supervised mean teacher scheme, domain adaptation module with randomized histogram-based style transfer and contrast-aware network, and continual test-time adaptation strategy.

Result: Outperforms nnU-Netv2 baseline with superior Dice score and Hausdorff Distance, showing strong generalization to unseen domains under low-annotation conditions.

Conclusion: The proposed framework effectively addresses domain generalization challenges in liver MRI segmentation through combined semi-supervised learning, domain adaptation, and test-time adaptation strategies.

Abstract: Accurate liver segmentation from contrast-enhanced MRI is essential for diagnosis, treatment planning, and disease monitoring. However, it remains challenging due to limited annotated data, heterogeneous enhancement protocols, and significant domain shifts across scanners and institutions. Traditional image-to-image translation frameworks have made great progress in domain generalization, but their application is not straightforward. For example, Pix2Pix requires image registration, and cycle-GAN cannot be integrated seamlessly into segmentation pipelines. Meanwhile, these methods are originally used to deal with cross-modality scenarios, and often introduce structural distortions and suffer from unstable training, which may pose drawbacks in our single-modality scenario. To address these challenges, we propose CoSSeg-TTA, a compact segmentation framework for the GED4 (Gd-EOB-DTPA enhanced hepatobiliary phase MRI) modality built upon nnU-Netv2 and enhanced with a semi-supervised mean teacher scheme to exploit large amounts of unlabeled volumes. A domain adaptation module, incorporating a randomized histogram-based style appearance transfer function and a trainable contrast-aware network, enriches domain diversity and mitigates cross-center variability. Furthermore, a continual test-time adaptation strategy is employed to improve robustness during inference. Extensive experiments demonstrate that our framework consistently outperforms the nnU-Netv2 baseline, achieving superior Dice score and Hausdorff Distance while exhibiting strong generalization to unseen domains under low-annotation conditions.

[330] Concept-Based Masking: A Patch-Agnostic Defense Against Adversarial Patch Attacks

Ayushi Mehrotra, Derek Peng, Dipkamal Bhusal, Nidhi Rastogi

Main category: cs.CV

TL;DR: A patch-agnostic defense method using concept-based explanations to neutralize adversarial patch attacks without requiring prior knowledge of patch size or location.

DetailsMotivation: Existing defenses against adversarial patch attacks typically assume prior knowledge of patch size or location, which limits their practical applicability in real-world scenarios.

Method: Leverages concept-based explanations to identify and suppress the most influential concept activation vectors, thereby neutralizing patch effects without explicit detection.

Result: Achieves higher robust and clean accuracy than state-of-the-art PatchCleanser on Imagenette with ResNet-50, while maintaining strong performance across varying patch sizes and locations.

Conclusion: Combining interpretability with robustness shows promise, and concept-driven defenses represent a scalable strategy for securing machine learning models against adversarial patch attacks.

Abstract: Adversarial patch attacks pose a practical threat to deep learning models by forcing targeted misclassifications through localized perturbations, often realized in the physical world. Existing defenses typically assume prior knowledge of patch size or location, limiting their applicability. In this work, we propose a patch-agnostic defense that leverages concept-based explanations to identify and suppress the most influential concept activation vectors, thereby neutralizing patch effects without explicit detection. Evaluated on Imagenette with a ResNet-50, our method achieves higher robust and clean accuracy than the state-of-the-art PatchCleanser, while maintaining strong performance across varying patch sizes and locations. Our results highlight the promise of combining interpretability with robustness and suggest concept-driven defenses as a scalable strategy for securing machine learning models against adversarial patch attacks.

[331] Flexible and Efficient Spatio-Temporal Transformer for Sequential Visual Place Recognition

Yu Kiu, Lau, Chao Chen, Ge Jin, Chen Feng

Main category: cs.CV

TL;DR: Adapt-STformer is a flexible and efficient Seq-VPR method using Recurrent Deformable Transformer Encoder that supports variable sequence lengths, reduces inference time by 36% and memory usage by 35% while improving recall by up to 17%.

DetailsMotivation: Existing transformer-based Seq-VPR methods prioritize performance but lack flexibility for variable sequence lengths, fast inference, and low memory usage needed for real-time applications.

Method: Proposed Adapt-STformer with Recurrent Deformable Transformer Encoder (Recurrent-DTE) that uses iterative recurrent mechanism to fuse information from multiple sequential frames.

Result: Achieved up to 17% recall improvement, 36% reduction in sequence extraction time, and 35% lower memory usage compared to second-best baseline on Nordland, Oxford, and NuScenes datasets.

Conclusion: Adapt-STformer successfully addresses the flexibility-efficiency trade-off in Seq-VPR, enabling variable sequence length support with improved performance and reduced computational requirements.

Abstract: Sequential Visual Place Recognition (Seq-VPR) leverages transformers to capture spatio-temporal features effectively; however, existing approaches prioritize performance at the expense of flexibility and efficiency. In practice, a transformer-based Seq-VPR model should be flexible to the number of frames per sequence (seq-length), deliver fast inference, and have low memory usage to meet real-time constraints. To our knowledge, no existing transformer-based Seq-VPR method achieves both flexibility and efficiency. To address this gap, we propose Adapt-STformer, a Seq-VPR method built around our novel Recurrent Deformable Transformer Encoder (Recurrent-DTE), which uses an iterative recurrent mechanism to fuse information from multiple sequential frames. This design naturally supports variable seq-lengths, fast inference, and low memory usage. Experiments on the Nordland, Oxford, and NuScenes datasets show that Adapt-STformer boosts recall by up to 17% while reducing sequence extraction time by 36% and lowering memory usage by 35% compared to the second-best baseline.

[332] ChronoEdit: Towards Temporal Reasoning for Image Editing and World Simulation

Jay Zhangjie Wu, Xuanchi Ren, Tianchang Shen, Tianshi Cao, Kai He, Yifan Lu, Ruiyuan Gao, Enze Xie, Shiyi Lan, Jose M. Alvarez, Jun Gao, Sanja Fidler, Zian Wang, Huan Ling

Main category: cs.CV

TL;DR: ChronoEdit reframes image editing as video generation to ensure physical consistency by treating input and edited images as video frames and using temporal reasoning with large pretrained video models.

DetailsMotivation: Current image editing methods lack physical consistency, which is crucial for world simulation tasks where edited objects must remain coherent and plausible.

Method: Treats input and edited images as first/last video frames, leverages pretrained video models for temporal consistency, introduces temporal reasoning stage with reasoning tokens to imagine plausible editing trajectories, then drops tokens to avoid full video rendering costs.

Result: Outperforms state-of-the-art baselines in both visual fidelity and physical plausibility on the new PBench-Edit benchmark for physically consistent image editing.

Conclusion: ChronoEdit effectively bridges the physical consistency gap in image editing by reformulating it as a video generation problem, with 14B and 2B model variants to be released.

Abstract: Recent advances in large generative models have significantly advanced image editing and in-context image generation, yet a critical gap remains in ensuring physical consistency, where edited objects must remain coherent. This capability is especially vital for world simulation related tasks. In this paper, we present ChronoEdit, a framework that reframes image editing as a video generation problem. First, ChronoEdit treats the input and edited images as the first and last frames of a video, allowing it to leverage large pretrained video generative models that capture not only object appearance but also the implicit physics of motion and interaction through learned temporal consistency. Second, ChronoEdit introduces a temporal reasoning stage that explicitly performs editing at inference time. Under this setting, the target frame is jointly denoised with reasoning tokens to imagine a plausible editing trajectory that constrains the solution space to physically viable transformations. The reasoning tokens are then dropped after a few steps to avoid the high computational cost of rendering a full video. To validate ChronoEdit, we introduce PBench-Edit, a new benchmark of image-prompt pairs for contexts that require physical consistency, and demonstrate that ChronoEdit surpasses state-of-the-art baselines in both visual fidelity and physical plausibility. Code and models for both the 14B and 2B variants of ChronoEdit will be released on the project page: https://research.nvidia.com/labs/toronto-ai/chronoedit

[333] CARE-PD: A Multi-Site Anonymized Clinical Dataset for Parkinson’s Disease Gait Assessment

Vida Adeli, Ivan Klabucar, Javad Rajabi, Benjamin Filtjens, Soroush Mehraban, Diwei Wang, Hyewon Seo, Trung-Hieu Hoang, Minh N. Do, Candice Muller, Claudia Oliveira, Daniel Boari Coelho, Pieter Ginis, Moran Gilat, Alice Nieuwboer, Joke Spildooren, Lucas Mckay, Hyeokhyen Kwon, Gari Clifford, Christine Esper, Stewart Factor, Imari Genias, Amirhossein Dadashzadeh, Leia Shum, Alan Whone, Majid Mirmehdi, Andrea Iaboni, Babak Taati

Main category: cs.CV

TL;DR: CARE-PD is the largest publicly available 3D mesh gait dataset for Parkinson’s Disease, spanning 9 cohorts from 8 clinical centers, supporting supervised clinical score prediction and unsupervised motion pretext tasks.

DetailsMotivation: Objective gait assessment in Parkinson's Disease is limited by the absence of large, diverse, and clinically annotated motion datasets.

Method: All recordings (RGB video or motion capture) are converted into anonymized SMPL meshes via a harmonized preprocessing pipeline. The dataset supports supervised clinical score prediction (UPDRS gait scores) and unsupervised motion pretext tasks (2D-to-3D keypoint lifting and full-body 3D reconstruction).

Result: Motion encoders consistently outperform handcrafted features. Pretraining on CARE-PD reduces MPJPE from 60.8mm to 7.5mm and boosts PD severity macro-F1 by 17 percentage points.

Conclusion: CARE-PD demonstrates the value of clinically curated, diverse training data for Parkinson’s Disease assessment and is released for non-commercial research.

Abstract: Objective gait assessment in Parkinson’s Disease (PD) is limited by the absence of large, diverse, and clinically annotated motion datasets. We introduce CARE-PD, the largest publicly available archive of 3D mesh gait data for PD, and the first multi-site collection spanning 9 cohorts from 8 clinical centers. All recordings (RGB video or motion capture) are converted into anonymized SMPL meshes via a harmonized preprocessing pipeline. CARE-PD supports two key benchmarks: supervised clinical score prediction (estimating Unified Parkinson’s Disease Rating Scale, UPDRS, gait scores) and unsupervised motion pretext tasks (2D-to-3D keypoint lifting and full-body 3D reconstruction). Clinical prediction is evaluated under four generalization protocols: within-dataset, cross-dataset, leave-one-dataset-out, and multi-dataset in-domain adaptation. To assess clinical relevance, we compare state-of-the-art motion encoders with a traditional gait-feature baseline, finding that encoders consistently outperform handcrafted features. Pretraining on CARE-PD reduces MPJPE (from 60.8mm to 7.5mm) and boosts PD severity macro-F1 by 17 percentage points, underscoring the value of clinically curated, diverse training data. CARE-PD and all benchmark code are released for non-commercial research at https://neurips2025.care-pd.ca/.

[334] GenAR: Next-Scale Autoregressive Generation for Spatial Gene Expression Prediction

Jiarui Ouyang, Yihui Wang, Yihang Gao, Yingxue Xu, Shu Yang, Hao Chen

Main category: cs.CV

TL;DR: GenAR is a multi-scale autoregressive framework that predicts spatial gene expression from H&E stained images by modeling gene dependencies hierarchically and generating discrete count tokens, achieving state-of-the-art performance.

DetailsMotivation: Spatial Transcriptomics is expensive, while H&E stained images are widely available. Existing methods predict genes independently and use continuous regression despite expression being discrete counts, leading to biologically implausible results.

Method: GenAR clusters genes into hierarchical groups to capture dependencies, models expression as discrete token generation for raw counts, and uses fused histological and spatial embeddings in a coarse-to-fine decoding process.

Result: Extensive experiments on four Spatial Transcriptomics datasets across different tissue types show GenAR achieves state-of-the-art performance.

Conclusion: GenAR offers a cost-effective alternative for spatial gene expression prediction with potential applications in precision medicine and molecular profiling.

Abstract: Spatial Transcriptomics (ST) offers spatially resolved gene expression but remains costly. Predicting expression directly from widely available Hematoxylin and Eosin (H&E) stained images presents a cost-effective alternative. However, most computational approaches (i) predict each gene independently, overlooking co-expression structure, and (ii) cast the task as continuous regression despite expression being discrete counts. This mismatch can yield biologically implausible outputs and complicate downstream analyses. We introduce GenAR, a multi-scale autoregressive framework that refines predictions from coarse to fine. GenAR clusters genes into hierarchical groups to expose cross-gene dependencies, models expression as codebook-free discrete token generation to directly predict raw counts, and conditions decoding on fused histological and spatial embeddings. From an information-theoretic perspective, the discrete formulation avoids log-induced biases and the coarse-to-fine factorization aligns with a principled conditional decomposition. Extensive experimental results on four Spatial Transcriptomics datasets across different tissue types demonstrate that GenAR achieves state-of-the-art performance, offering potential implications for precision medicine and cost-effective molecular profiling. Code is publicly available at https://github.com/oyjr/genar.

[335] RAP: 3D Rasterization Augmented End-to-End Planning

Lan Feng, Yang Gao, Eloi Zablocki, Quanyi Li, Wuyang Li, Sichao Liu, Matthieu Cord, Alexandre Alahi

Main category: cs.CV

TL;DR: RAP (Rasterization Augmented Planning) uses lightweight 3D rasterization instead of photorealistic rendering for end-to-end driving training, enabling scalable data augmentation with counterfactual recovery maneuvers and cross-agent view synthesis, achieving state-of-the-art performance on major benchmarks.

DetailsMotivation: Imitation learning for end-to-end driving lacks recovery data - small mistakes compound into failures. Photorealistic rendering methods are too slow and costly for training. The key insight is that photorealism is unnecessary; semantic fidelity (geometry and dynamics) matters more than textures or lighting.

Method: Proposes 3D Rasterization to replace costly rendering with lightweight rasterization of annotated primitives. Enables augmentations like counterfactual recovery maneuvers and cross-agent view synthesis. Introduces Raster-to-Real feature-space alignment to bridge sim-to-real gap.

Result: Achieves state-of-the-art closed-loop robustness and long-tail generalization. Ranks first on four major benchmarks: NAVSIM v1/v2, Waymo Open Dataset Vision-based E2E Driving, and Bench2Drive.

Conclusion: Lightweight rasterization with feature alignment suffices to scale E2E training, offering a practical alternative to photorealistic rendering for end-to-end driving planners.

Abstract: Imitation learning for end-to-end driving trains policies only on expert demonstrations. Once deployed in a closed loop, such policies lack recovery data: small mistakes cannot be corrected and quickly compound into failures. A promising direction is to generate alternative viewpoints and trajectories beyond the logged path. Prior work explores photorealistic digital twins via neural rendering or game engines, but these methods are prohibitively slow and costly, and thus mainly used for evaluation. In this work, we argue that photorealism is unnecessary for training end-to-end planners. What matters is semantic fidelity and scalability: driving depends on geometry and dynamics, not textures or lighting. Motivated by this, we propose 3D Rasterization, which replaces costly rendering with lightweight rasterization of annotated primitives, enabling augmentations such as counterfactual recovery maneuvers and cross-agent view synthesis. To transfer these synthetic views effectively to real-world deployment, we introduce a Raster-to-Real feature-space alignment that bridges the sim-to-real gap. Together, these components form Rasterization Augmented Planning (RAP), a scalable data augmentation pipeline for planning. RAP achieves state-of-the-art closed-loop robustness and long-tail generalization, ranking first on four major benchmarks: NAVSIM v1/v2, Waymo Open Dataset Vision-based E2E Driving, and Bench2Drive. Our results show that lightweight rasterization with feature alignment suffices to scale E2E training, offering a practical alternative to photorealistic rendering. Project page: https://alan-lanfeng.github.io/RAP/.

[336] Diffusion^2: Dual Diffusion Model with Uncertainty-Aware Adaptive Noise for Momentary Trajectory Prediction

Yuhao Luo, Yuang Zhang, Kehua Chen, Xinyu Zheng, Shucheng Zhang, Sikai Chen, Yinhai Wang

Main category: cs.CV

TL;DR: Diffusion^2 is a novel framework for momentary pedestrian trajectory prediction using two sequential diffusion models - one for backward historical trajectory generation and another for forward future prediction, with uncertainty estimation and adaptive noise modulation.

DetailsMotivation: Real-world scenarios often lack sufficient observational data (e.g., pedestrians emerging from blind spots), making accurate trajectory prediction challenging and increasing traffic accident risks. Current methods struggle with momentary trajectory prediction.

Method: Two sequentially connected diffusion models: backward diffusion for generating unobserved historical trajectories, and forward diffusion for future trajectory prediction. Includes dual-head parameterization for aleatoric uncertainty estimation and temporally adaptive noise module for dynamic noise scale modulation.

Result: Sets new state-of-the-art performance in momentary trajectory prediction on ETH/UCY and Stanford Drone datasets.

Conclusion: The proposed Diffusion^2 framework effectively addresses momentary trajectory prediction challenges and enhances traffic safety in autonomous driving scenarios.

Abstract: Accurate pedestrian trajectory prediction is crucial for ensuring safety and efficiency in autonomous driving and human-robot interaction scenarios. Earlier studies primarily utilized sufficient observational data to predict future trajectories. However, in real-world scenarios, such as pedestrians suddenly emerging from blind spots, sufficient observational data is often unavailable (i.e. momentary trajectory), making accurate prediction challenging and increasing the risk of traffic accidents. Therefore, advancing research on pedestrian trajectory prediction under extreme scenarios is critical for enhancing traffic safety. In this work, we propose a novel framework termed Diffusion^2, tailored for momentary trajectory prediction. Diffusion^2 consists of two sequentially connected diffusion models: one for backward prediction, which generates unobserved historical trajectories, and the other for forward prediction, which forecasts future trajectories. Given that the generated unobserved historical trajectories may introduce additional noise, we propose a dual-head parameterization mechanism to estimate their aleatoric uncertainty and design a temporally adaptive noise module that dynamically modulates the noise scale in the forward diffusion process. Empirically, Diffusion^2 sets a new state-of-the-art in momentary trajectory prediction on ETH/UCY and Stanford Drone datasets.

[337] MorphoSim: An Interactive, Controllable, and Editable Language-guided 4D World Simulator

Xuehai He, Shijie Zhou, Thivyanth Venkateswaran, Kaizhi Zheng, Ziyu Wan, Achuta Kadambi, Xin Eric Wang

Main category: cs.CV

TL;DR: MorphoSim is a language-guided framework that generates 4D scenes with multi-view consistency and object-level controls, enabling interactive editing without full re-generation.

DetailsMotivation: World models with controllable and editable spatiotemporal environments are valuable for robotics for scalable training data, reproducible evaluation, and flexible task design. Current text-to-video models are limited to 2D views and offer limited interaction.

Method: Integrates trajectory-guided generation with feature field distillation, allowing edits to be applied interactively without full re-generation. Generates dynamic environments from natural language instructions with object-level controls.

Result: MorphoSim maintains high scene fidelity while enabling controllability and editability. Objects can be directed, recolored, or removed, and scenes can be observed from arbitrary viewpoints.

Conclusion: MorphoSim provides a framework for generating 4D scenes with multi-view consistency and object-level controls, addressing limitations of current text-to-video models and offering practical applications for robotics.

Abstract: World models that support controllable and editable spatiotemporal environments are valuable for robotics, enabling scalable training data, repro ducible evaluation, and flexible task design. While recent text-to-video models generate realistic dynam ics, they are constrained to 2D views and offer limited interaction. We introduce MorphoSim, a language guided framework that generates 4D scenes with multi-view consistency and object-level controls. From natural language instructions, MorphoSim produces dynamic environments where objects can be directed, recolored, or removed, and scenes can be observed from arbitrary viewpoints. The framework integrates trajectory-guided generation with feature field dis tillation, allowing edits to be applied interactively without full re-generation. Experiments show that Mor phoSim maintains high scene fidelity while enabling controllability and editability. The code is available at https://github.com/eric-ai-lab/Morph4D.

[338] Your Vision-Language Model Can’t Even Count to 20: Exposing the Failures of VLMs in Compositional Counting

Xuyang Guo, Zekai Huang, Zhenmei Shi, Zhao Song, Jiahao Zhang

Main category: cs.CV

TL;DR: VLMs struggle with compositional counting tasks involving multiple shape types, despite performing well on single-shape counting, revealing a fundamental limitation in current models.

DetailsMotivation: To investigate whether Vision-Language Models (VLMs) can correctly count objects, particularly in compositional scenarios with multiple shape types, which remains an open question despite their impressive performance on other vision-language tasks.

Method: Created VLMCountBench - a minimalist benchmark using basic geometric shapes (triangles, circles) and their compositions, with strict variable control to study effects of color, size, and prompt refinement through controlled ablation studies.

Result: VLMs can count reliably when only one shape type is present, but exhibit substantial failures when multiple shape types are combined (compositional counting), showing systematic counting errors in complex scenarios.

Conclusion: Current VLMs have a fundamental empirical limitation in compositional counting, which motivates important future research directions to address this weakness in vision-language understanding.

Abstract: Vision-Language Models (VLMs) have become a central focus of today’s AI community, owing to their impressive abilities gained from training on large-scale vision-language data from the Web. These models have demonstrated strong performance across diverse tasks, including image understanding, video understanding, complex visual reasoning, and embodied AI. Despite these noteworthy successes, a fundamental question remains: Can VLMs count objects correctly? In this paper, we introduce a simple yet effective benchmark, VLMCountBench, designed under a minimalist setting with only basic geometric shapes (e.g., triangles, circles) and their compositions, focusing exclusively on counting tasks without interference from other factors. We adopt strict independent variable control and systematically study the effects of simple properties such as color, size, and prompt refinement in a controlled ablation. Our empirical results reveal that while VLMs can count reliably when only one shape type is present, they exhibit substantial failures when multiple shape types are combined (i.e., compositional counting). This highlights a fundamental empirical limitation of current VLMs and motivates important directions for future research.

[339] CodeFormer++: Blind Face Restoration Using Deformable Registration and Deep Metric Learning

Venkata Bharath Reddy Reddem, Akshay P Sarashetti, Ranjith Merugu, Amit Satish Unde

Main category: cs.CV

TL;DR: CodeFormer++ is a novel blind face restoration framework that addresses the trade-off between visual quality and identity preservation by decomposing the problem into three sub-tasks and using deformable registration, texture guidance, and deep metric learning.

DetailsMotivation: Existing blind face restoration methods struggle with balancing visual quality and identity fidelity, often resulting in either identity distortion or poor degradation removal.

Method: Decomposes BFR into three sub-tasks: identity-preserving restoration, high-quality generation, and dynamic fusion. Uses learning-based deformable face registration, texture-guided restoration network, and deep metric learning with informative positive/negative samples.

Result: Extensive experiments on real-world and synthetic datasets show superior performance in both visual fidelity and identity consistency compared to existing methods.

Conclusion: CodeFormer++ effectively maximizes generative priors for high-quality face restoration while preserving identity through its three-component approach.

Abstract: Blind face restoration (BFR) has attracted increasing attention with the rise of generative methods. Most existing approaches integrate generative priors into the restoration pro- cess, aiming to jointly address facial detail generation and identity preservation. However, these methods often suffer from a trade-off between visual quality and identity fidelity, leading to either identity distortion or suboptimal degradation removal. In this paper, we present CodeFormer++, a novel framework that maximizes the utility of generative priors for high-quality face restoration while preserving identity. We decompose BFR into three sub-tasks: (i) identity- preserving face restoration, (ii) high-quality face generation, and (iii) dynamic fusion of identity features with realistic texture details. Our method makes three key contributions: (1) a learning-based deformable face registration module that semantically aligns generated and restored faces; (2) a texture guided restoration network to dynamically extract and transfer the texture of generated face to boost the quality of identity-preserving restored face; and (3) the integration of deep metric learning for BFR with the generation of informative positive and hard negative samples to better fuse identity- preserving and generative features. Extensive experiments on real-world and synthetic datasets demonstrate that, the pro- posed CodeFormer++ achieves superior performance in terms of both visual fidelity and identity consistency.

[340] A.I.R.: Enabling Adaptive, Iterative, and Reasoning-based Frame Selection For Video Question Answering

Yuanhao Zou, Shengji Jin, Andong Deng, Youpeng Zhao, Jun Wang, Chen Chen

Main category: cs.CV

TL;DR: A.I.R. is a training-free frame selection method for VideoQA that uses a powerful VLM for deep semantic analysis of queries and processes frames iteratively in small batches to balance accuracy and computational efficiency.

DetailsMotivation: Current frame selection methods face a trade-off: lightweight models like CLIP fail to capture query nuances, while VLM-based methods are computationally prohibitive for processing entire videos.

Method: A.I.R. uses a powerful VLM for deep semantic analysis of complex queries and deploys this analysis in a cost-effective iterative loop that processes only small batches of high-potential frames at a time.

Result: Extensive experiments show A.I.R. outperforms existing frame selection methods, significantly boosts foundation VLM performance, and achieves substantial computational efficiency gains over other VLM-based techniques.

Conclusion: The proposed A.I.R. approach effectively addresses the accuracy-efficiency trade-off in frame selection for VideoQA by combining deep semantic analysis with iterative processing of high-potential frames.

Abstract: Effectively applying Vision-Language Models (VLMs) to Video Question Answering (VideoQA) hinges on selecting a concise yet comprehensive set of frames, as processing entire videos is computationally infeasible. However, current frame selection methods face a critical trade-off: approaches relying on lightweight similarity models, such as CLIP, often fail to capture the nuances of complex queries, resulting in inaccurate similarity scores that cannot reflect the authentic query-frame relevance, which further undermines frame selection. Meanwhile, methods that leverage a VLM for deeper analysis achieve higher accuracy but incur prohibitive computational costs. To address these limitations, we propose A.I.R., a training-free approach for Adaptive, Iterative, and Reasoning-based frame selection. We leverage a powerful VLM to perform deep, semantic analysis on complex queries, and this analysis is deployed within a cost-effective iterative loop that processes only a small batch of the most high-potential frames at a time. Extensive experiments on various VideoQA benchmarks demonstrate that our approach outperforms existing frame selection methods, significantly boosts the performance of the foundation VLM, and achieves substantial gains in computational efficiency over other VLM-based techniques.

[341] REAR: Rethinking Visual Autoregressive Models via Generator-Tokenizer Consistency Regularization

Qiyuan He, Yicong Li, Haotian Ye, Jinghao Wang, Xinyao Liao, Pheng-Ann Heng, Stefano Ermon, James Zou, Angela Yao

Main category: cs.CV

TL;DR: reAR is a simple training strategy that improves visual autoregressive generation by addressing generator-tokenizer inconsistency through token-wise regularization, achieving performance comparable to larger diffusion models.

DetailsMotivation: Visual autoregressive generation underperforms compared to diffusion models, with prior work attributing this to tokenizer limitations and rasterization ordering. The core issue identified is generator-tokenizer inconsistency - AR-generated tokens may not be well-decoded by the tokenizer.

Method: Proposes reAR training strategy with token-wise regularization: when predicting next token, the causal transformer also recovers visual embedding of current token and predicts embedding of target token under noisy context. No changes needed to tokenizer, generation order, inference pipeline, or external models.

Result: Substantially improves performance: On ImageNet, reduces gFID from 3.02 to 1.86 and improves IS to 316.9 with standard rasterization tokenizer. With advanced tokenizers, achieves gFID of 1.42 with only 177M parameters, matching performance of larger state-of-the-art diffusion models (675M).

Conclusion: reAR effectively addresses generator-tokenizer inconsistency in visual autoregressive generation through simple regularization, achieving competitive performance with diffusion models while maintaining model simplicity and requiring no architectural changes.

Abstract: Visual autoregressive (AR) generation offers a promising path toward unifying vision and language models, yet its performance remains suboptimal against diffusion models. Prior work often attributes this gap to tokenizer limitations and rasterization ordering. In this work, we identify a core bottleneck from the perspective of generator-tokenizer inconsistency, i.e., the AR-generated tokens may not be well-decoded by the tokenizer. To address this, we propose reAR, a simple training strategy introducing a token-wise regularization objective: when predicting the next token, the causal transformer is also trained to recover the visual embedding of the current token and predict the embedding of the target token under a noisy context. It requires no changes to the tokenizer, generation order, inference pipeline, or external models. Despite its simplicity, reAR substantially improves performance. On ImageNet, it reduces gFID from 3.02 to 1.86 and improves IS to 316.9 using a standard rasterization-based tokenizer. When applied to advanced tokenizers, it achieves a gFID of 1.42 with only 177M parameters, matching the performance with larger state-of-the-art diffusion models (675M).

[342] MedCLM: Learning to Localize and Reason via a CoT-Curriculum in Medical Vision-Language Models

Soo Yong Kim, Suin Cho, Vincent-Daniel Yun, Gyeongyeon Hwang

Main category: cs.CV

TL;DR: MedCLM is an automated pipeline that converts detection datasets into medical VQA data with Chain-of-Thought reasoning, achieving state-of-the-art performance through a progressive curriculum strategy.

DetailsMotivation: Bridging clinical diagnostic reasoning with AI remains a central challenge in medical imaging, requiring systems that can provide step-by-step reasoning similar to human clinicians.

Method: Automated pipeline converts detection datasets into medical VQA data with CoT reasoning by linking lesion boxes to organ segmentation and structured rationales. Uses Integrated CoT-Curriculum Strategy with Easy (explicit boxes), Medium (implicit localization), and Hard (weakly supervised) stages.

Result: MedCLM attains state-of-the-art performance on several medical VQA benchmarks, demonstrating superior reasoning capabilities.

Conclusion: Provides a scalable framework for developing clinically aligned medical vision-language models that can generate question-answer pairs with step-by-step reasoning.

Abstract: Bridging clinical diagnostic reasoning with AI remains a central challenge in medical imaging. We introduce MedCLM, an automated pipeline that converts detection datasets into large-scale medical visual question answering (VQA) data with Chain-of-Thought (CoT) reasoning by linking lesion boxes to organ segmentation and structured rationales. These contextual signals enable medical vision-language models to generate question-answer pairs with step-by-step reasoning. To utilize this data effectively, we propose an Integrated CoT-Curriculum Strategy composed of an Easy stage with explicit lesion boxes for visual grounding, a Medium stage that encourages implicit localization, and a Hard stage for weakly supervised reasoning. Experimental results demonstrate that MedCLM attains state-of-the-art performance on several medical VQA benchmarks, providing a scalable framework for developing clinically aligned medical vision-language models.

[343] VaseVQA-3D: Benchmarking 3D VLMs on Ancient Greek Pottery

Nonghai Zhang, Zeyu Zhang, Jiazi Wang, Yang Zhao, Hao Tang

Main category: cs.CV

TL;DR: The paper introduces VaseVQA-3D, the first 3D visual question answering dataset for ancient Greek pottery analysis, and proposes VaseVLM model to address data scarcity and domain knowledge limitations in cultural heritage applications of Vision-Language Models.

DetailsMotivation: Existing VLMs struggle with specialized cultural heritage domains like 3D vase artifacts due to data scarcity and insufficient domain knowledge, limiting their effectiveness in culturally significant tasks.

Method: Created VaseVQA-3D dataset with 664 ancient Greek vase 3D models and corresponding QA data, established a complete data construction pipeline, and developed VaseVLM model with domain-adaptive training.

Result: Improved by 12.8% on R@1 metrics and 6.6% on lexical similarity compared to previous state-of-the-art on the VaseVQA-3D dataset, significantly enhancing 3D vase artifact recognition and understanding.

Conclusion: The approach provides new technical pathways for digital heritage preservation research by effectively addressing domain-specific challenges in cultural heritage applications of VLMs.

Abstract: Vision-Language Models (VLMs) have achieved significant progress in multimodal understanding tasks, demonstrating strong capabilities particularly in general tasks such as image captioning and visual reasoning. However, when dealing with specialized cultural heritage domains like 3D vase artifacts, existing models face severe data scarcity issues and insufficient domain knowledge limitations. Due to the lack of targeted training data, current VLMs struggle to effectively handle such culturally significant specialized tasks. To address these challenges, we propose the VaseVQA-3D dataset, which serves as the first 3D visual question answering dataset for ancient Greek pottery analysis, collecting 664 ancient Greek vase 3D models with corresponding question-answer data and establishing a complete data construction pipeline. We further develop the VaseVLM model, enhancing model performance in vase artifact analysis through domain-adaptive training. Experimental results validate the effectiveness of our approach, where we improve by 12.8% on R@1 metrics and by 6.6% on lexical similarity compared with previous state-of-the-art on the VaseVQA-3D dataset, significantly improving the recognition and understanding of 3D vase artifacts, providing new technical pathways for digital heritage preservation research.

[344] TBStar-Edit: From Image Editing Pattern Shifting to Consistency Enhancement

Hao Fang, Zechao Zhan, Weixin Feng, Ziwei Huang, XuBin Li, Tiezheng Ge

Main category: cs.CV

TL;DR: TBStar-Edit is a specialized image editing model for e-commerce that addresses consistency limitations of general models through hierarchical architecture and two-stage training.

DetailsMotivation: General image editing models perform poorly in e-commerce scenarios due to consistency limitations in maintaining product appearance and layout integrity.

Method: Three-part approach: 1) Comprehensive data pipeline for high-quality editing data, 2) Hierarchical model with base model, pattern shifting modules, and consistency enhancement modules, 3) Two-stage training strategy (pattern shifting then consistency enhancement) with separate datasets.

Result: TBStar-Edit outperforms existing general-domain editing models in both objective metrics (VIE Score) and subjective user preference on e-commerce benchmarks.

Conclusion: The proposed specialized approach with data engineering, hierarchical architecture, and staged training effectively addresses e-commerce image editing consistency challenges.

Abstract: Recent advances in image generation and editing technologies have enabled state-of-the-art models to achieve impressive results in general domains. However, when applied to e-commerce scenarios, these general models often encounter consistency limitations. To address this challenge, we introduce TBStar-Edit, an new image editing model tailored for the e-commerce domain. Through rigorous data engineering, model architecture design and training strategy, TBStar-Edit achieves precise and high-fidelity image editing while maintaining the integrity of product appearance and layout. Specifically, for data engineering, we establish a comprehensive data construction pipeline, encompassing data collection, construction, filtering, and augmentation, to acquire high-quality, instruction-following, and strongly consistent editing data to support model training. For model architecture design, we design a hierarchical model framework consisting of a base model, pattern shifting modules, and consistency enhancement modules. For model training, we adopt a two-stage training strategy to enhance the consistency preservation: first stage for editing pattern shifting, and second stage for consistency enhancement. Each stage involves training different modules with separate datasets. Finally, we conduct extensive evaluations of TBStar-Edit on a self-proposed e-commerce benchmark, and the results demonstrate that TBStar-Edit outperforms existing general-domain editing models in both objective metrics (VIE Score) and subjective user preference.

[345] Asynchronous Denoising Diffusion Models for Aligning Text-to-Image Generation

Zijing Hu, Yunze Tong, Fengda Zhang, Junkun Yuan, Jun Xiao, Kun Kuang

Main category: cs.CV

TL;DR: Asynchronous diffusion models improve text-to-image alignment by assigning different timesteps to different pixels, allowing prompt-related regions to reference clearer context from less noisy areas.

DetailsMotivation: Standard diffusion models struggle with faithful text-to-image alignment due to synchronous denoising, where all pixels evolve simultaneously from noise, preventing prompt-related regions from obtaining clear context.

Method: Proposed asynchronous diffusion framework that allocates distinct timesteps to different pixels and dynamically modulates pixel-wise denoising schedules, allowing prompt-related regions to denoise more gradually while leveraging clearer context from less noisy areas.

Result: Extensive experiments show significant improvement in text-to-image alignment across diverse prompts compared to standard synchronous diffusion models.

Conclusion: Asynchronous diffusion models effectively address the alignment limitations of traditional diffusion models by enabling better context utilization through staggered pixel denoising schedules.

Abstract: Diffusion models have achieved impressive results in generating high-quality images. Yet, they often struggle to faithfully align the generated images with the input prompts. This limitation arises from synchronous denoising, where all pixels simultaneously evolve from random noise to clear images. As a result, during generation, the prompt-related regions can only reference the unrelated regions at the same noise level, failing to obtain clear context and ultimately impairing text-to-image alignment. To address this issue, we propose asynchronous diffusion models – a novel framework that allocates distinct timesteps to different pixels and reformulates the pixel-wise denoising process. By dynamically modulating the timestep schedules of individual pixels, prompt-related regions are denoised more gradually than unrelated regions, thereby allowing them to leverage clearer inter-pixel context. Consequently, these prompt-related regions achieve better alignment in the final images. Extensive experiments demonstrate that our asynchronous diffusion models can significantly improve text-to-image alignment across diverse prompts. The code repository for this work is available at https://github.com/hu-zijing/AsynDM.

[346] TAG:Tangential Amplifying Guidance for Hallucination-Resistant Diffusion Sampling

Hyunmin Cho, Donghoon Ahn, Susung Hong, Jee Eun Kim, Seungryong Kim, Kyong Hwan Jin

Main category: cs.CV

TL;DR: TAG is an efficient guidance method that amplifies tangential components of diffusion model scores to improve semantic consistency without model modifications.

DetailsMotivation: Current diffusion models suffer from semantic inconsistencies and hallucinations, and existing guidance methods introduce computational overhead through external signals or architectural changes.

Method: TAG uses an intermediate sample as projection basis and amplifies tangential components of estimated scores via first-order Taylor expansion to correct sampling trajectory.

Result: The method steers sampling toward higher-probability regions, reducing inconsistencies and enhancing sample quality with minimal computational addition.

Conclusion: TAG provides a plug-and-play, architecture-agnostic guidance approach that improves diffusion sampling fidelity efficiently.

Abstract: Recent diffusion models achieve the state-of-the-art performance in image generation, but often suffer from semantic inconsistencies or hallucinations. While various inference-time guidance methods can enhance generation, they often operate indirectly by relying on external signals or architectural modifications, which introduces additional computational overhead. In this paper, we propose Tangential Amplifying Guidance (TAG), a more efficient and direct guidance method that operates solely on trajectory signals without modifying the underlying diffusion model. TAG leverages an intermediate sample as a projection basis and amplifies the tangential components of the estimated scores with respect to this basis to correct the sampling trajectory. We formalize this guidance process by leveraging a first-order Taylor expansion, which demonstrates that amplifying the tangential component steers the state toward higher-probability regions, thereby reducing inconsistencies and enhancing sample quality. TAG is a plug-and-play, architecture-agnostic module that improves diffusion sampling fidelity with minimal computational addition, offering a new perspective on diffusion guidance.

[347] Conditional Representation Learning for Customized Tasks

Honglin Liu, Chao Sun, Peng Hu, Yunfan Li, Xi Peng

Main category: cs.CV

TL;DR: CRL learns conditional representations tailored to user-specified criteria by constructing semantic bases from LLM-generated descriptive texts and projecting image representations into these customized feature spaces using VLMs.

DetailsMotivation: Universal representations often prioritize dominant semantics that may not align with specific downstream tasks, while supervised fine-tuning is computationally expensive and requires extensive annotation.

Method: Uses LLM to generate descriptive texts for user criteria to construct semantic basis, then projects image representations into this conditional feature space using VLM.

Result: Extensive experiments on classification and retrieval tasks demonstrate superiority and generality of CRL over conventional approaches.

Conclusion: CRL effectively extracts task-specific representations without costly fine-tuning, enabling better performance on customized downstream tasks.

Abstract: Conventional representation learning methods learn a universal representation that primarily captures dominant semantics, which may not always align with customized downstream tasks. For instance, in animal habitat analysis, researchers prioritize scene-related features, whereas universal embeddings emphasize categorical semantics, leading to suboptimal results. As a solution, existing approaches resort to supervised fine-tuning, which however incurs high computational and annotation costs. In this paper, we propose Conditional Representation Learning (CRL), aiming to extract representations tailored to arbitrary user-specified criteria. Specifically, we reveal that the semantics of a space are determined by its basis, thereby enabling a set of descriptive words to approximate the basis for a customized feature space. Building upon this insight, given a user-specified criterion, CRL first employs a large language model (LLM) to generate descriptive texts to construct the semantic basis, then projects the image representation into this conditional feature space leveraging a vision-language model (VLM). The conditional representation better captures semantics for the specific criterion, which could be utilized for multiple customized tasks. Extensive experiments on classification and retrieval tasks demonstrate the superiority and generality of the proposed CRL. The code is available at https://github.com/XLearning-SCU/2025-NeurIPS-CRL.

[348] Pathology-CoT: Learning Visual Chain-of-Thought Agent from Expert Whole Slide Image Diagnosis Behavior

Sheng Wang, Ruiming Wu, Charles Herndon, Yihang Liu, Shunsuke Koga, Jeanne Shen, Zhi Huang

Main category: cs.CV

TL;DR: The paper introduces AI Session Recorder to capture pathologists’ viewing behavior from WSI viewers, creates Pathology-CoT dataset for behavior supervision, and builds Pathologist-o3 agent that outperforms state-of-the-art in lymph-node metastasis detection.

DetailsMotivation: Current pathology foundation models lack practical agentic systems that can decide viewing fields, adjust magnification, and provide explainable diagnoses. The key blocker is the absence of scalable supervision for expert viewing behavior that is tacit and experience-based.

Method: Developed AI Session Recorder to unobtrusively record pathologists’ navigation in standard WSI viewers, converted logs into behavioral commands, created Pathology-CoT dataset through human-in-the-loop review, and built two-stage Pathologist-o3 agent with behavior-guided reasoning.

Result: Achieved 84.5% precision, 100.0% recall, and 75.4% accuracy on gastrointestinal lymph-node metastasis detection, exceeding OpenAI o3 model and generalizing across backbones.

Conclusion: The framework makes agentic pathology practical by converting everyday viewer logs into scalable, expert-validated supervision, establishing a path to human-aligned, upgradeable clinical AI.

Abstract: Diagnosing a whole-slide image is an interactive, multi-stage process involving changes in magnification and movement between fields. Although recent pathology foundation models are strong, practical agentic systems that decide what field to examine next, adjust magnification, and deliver explainable diagnoses are still lacking. The blocker is data: scalable, clinically aligned supervision of expert viewing behavior that is tacit and experience-based, not written in textbooks or online, and therefore absent from large language model training. We introduce the AI Session Recorder, which works with standard WSI viewers to unobtrusively record routine navigation and convert the viewer logs into standardized behavioral commands (inspect or peek at discrete magnifications) and bounding boxes. A lightweight human-in-the-loop review turns AI-drafted rationales into the Pathology-CoT dataset, a form of paired “where to look” and “why it matters” supervision produced at roughly six times lower labeling time. Using this behavioral data, we build Pathologist-o3, a two-stage agent that first proposes regions of interest and then performs behavior-guided reasoning. On gastrointestinal lymph-node metastasis detection, it achieved 84.5% precision, 100.0% recall, and 75.4% accuracy, exceeding the state-of-the-art OpenAI o3 model and generalizing across backbones. To our knowledge, this constitutes one of the first behavior-grounded agentic systems in pathology. Turning everyday viewer logs into scalable, expert-validated supervision, our framework makes agentic pathology practical and establishes a path to human-aligned, upgradeable clinical AI.

[349] A Spatial-Spectral-Frequency Interactive Network for Multimodal Remote Sensing Classification

Hao Liu, Yunhao Gao, Wei Li, Mingyang Zhang, Maoguo Gong, Lorenzo Bruzzone

Main category: cs.CV

TL;DR: S^2Fin is a spatial-spectral-frequency interaction network that integrates frequency domain learning with multimodal remote sensing image classification, using sparse attention and frequency fusion strategies to improve feature extraction from heterogeneous data.

DetailsMotivation: Existing feature fusion techniques struggle to extract structural and detail features from heterogeneous and redundant multimodal remote sensing images, motivating the introduction of frequency domain learning.

Method: Proposes high-frequency sparse enhancement transformer with sparse spatial-spectral attention, two-level spatial-frequency fusion strategy (adaptive frequency channel module and high-frequency resonance mask), and spatial-spectral attention fusion module.

Result: Experiments on four benchmark multimodal datasets with limited labeled data demonstrate superior classification performance, outperforming state-of-the-art methods.

Conclusion: S^2Fin effectively integrates frequency domain learning to model key and sparse detail features, achieving better performance in multimodal remote sensing image classification.

Abstract: Deep learning-based methods have achieved significant success in remote sensing Earth observation data analysis. Numerous feature fusion techniques address multimodal remote sensing image classification by integrating global and local features. However, these techniques often struggle to extract structural and detail features from heterogeneous and redundant multimodal images. With the goal of introducing frequency domain learning to model key and sparse detail features, this paper introduces the spatial-spectral-frequency interaction network (S$^2$Fin), which integrates pairwise fusion modules across the spatial, spectral, and frequency domains. Specifically, we propose a high-frequency sparse enhancement transformer that employs sparse spatial-spectral attention to optimize the parameters of the high-frequency filter. Subsequently, a two-level spatial-frequency fusion strategy is introduced, comprising an adaptive frequency channel module that fuses low-frequency structures with enhanced high-frequency details, and a high-frequency resonance mask that emphasizes sharp edges via phase similarity. In addition, a spatial-spectral attention fusion module further enhances feature extraction at intermediate layers of the network. Experiments on four benchmark multimodal datasets with limited labeled data demonstrate that S$^2$Fin performs superior classification, outperforming state-of-the-art methods. The code is available at https://github.com/HaoLiu-XDU/SSFin.

[350] Do Superpixel Segmentation Methods Influence Deforestation Image Classification?

Hugo Resende, Fabio A. Faria, Eduardo B. Neto, Isabela Borlido, Victor Sundermann, Silvio Jamil F. Guimarães, Álvaro L. Fazenda

Main category: cs.CV

TL;DR: This study evaluates four superpixel segmentation methods alongside SLIC for deforestation detection, finding that while individual classifiers show similar performance, ensemble methods significantly improve accuracy.

DetailsMotivation: To improve deforestation detection in the ForestEyes project by evaluating whether alternative superpixel segmentation methods outperform the traditionally used SLIC algorithm for remote sensing image segmentation.

Method: Compared four top-performing superpixel segmentation methods with SLIC, trained classifiers using PyCaret AutoML library, and applied classifier fusion (ensemble) approaches.

Result: Initial results showed minimal performance variation among segmentation methods with individual classifiers, but ensemble methods demonstrated noticeable improvements in balanced accuracy.

Conclusion: Both segmentation method selection and classifier combination through ensemble approaches are crucial for optimizing deforestation detection performance in remote sensing applications.

Abstract: Image segmentation is a crucial step in various visual applications, including environmental monitoring through remote sensing. In the context of the ForestEyes project, which combines citizen science and machine learning to detect deforestation in tropical forests, image segments are used for labeling by volunteers and subsequent model training. Traditionally, the Simple Linear Iterative Clustering (SLIC) algorithm is adopted as the segmentation method. However, recent studies have indicated that other superpixel-based methods outperform SLIC in remote sensing image segmentation, and might suggest that they are more suitable for the task of detecting deforested areas. In this sense, this study investigated the impact of the four best segmentation methods, together with SLIC, on the training of classifiers for the target application. Initially, the results showed little variation in performance among segmentation methods, even when selecting the top five classifiers using the PyCaret AutoML library. However, by applying a classifier fusion approach (ensemble of classifiers), noticeable improvements in balanced accuracy were observed, highlighting the importance of both the choice of segmentation method and the combination of machine learning-based models for deforestation detection tasks.

[351] EduPersona: Benchmarking Subjective Ability Boundaries of Virtual Student Agents

Buyuan Zhu, Shiyu Hu, Yiping Ma, Yuanming Zhang, Kang Hao Cheong

Main category: cs.CV

TL;DR: EduPersona is a large-scale benchmark for evaluating subjective abilities of LLMs in classroom settings, featuring multi-language, multi-subject coverage and persona-based evaluation across three progressive tasks.

DetailsMotivation: As LLMs are increasingly used in education for virtual student agents, there's a need to assess their classroom-oriented subjective abilities to understand model boundaries and enable trustworthy deployment.

Method: Created EduPersona benchmark with 1,308 authentic classroom dialogues expanded to 128k turns through persona stylization. Evaluates three tasks: basic coherence, student realism, and long-term persona consistency. Tests three LLMs and their persona-fine-tuned variants.

Result: Persona-fine-tuned models showed significant improvements: TASK1 +33.6%, TASK2 +30.6%, TASK3 +14.9%, demonstrating dataset effectiveness and revealing heterogeneous difficulty of persona modeling.

Conclusion: EduPersona provides the first classroom benchmark for subjective abilities, establishes a decoupled research paradigm, and will be open-sourced to advance trustworthy AI in education.

Abstract: As large language models are increasingly integrated into education, virtual student agents are becoming vital for classroom simulation and teacher training. Yet their classroom-oriented subjective abilities remain largely unassessed, limiting understanding of model boundaries and hindering trustworthy deployment. We present EduPersona, a large-scale benchmark spanning two languages, three subjects, and ten persona types based on the Big Five theory. The dataset contains 1,308 authentic classroom dialogue rounds, corresponding to 12,814 teacher-student Q&A turns, and is further expanded through persona stylization into roughly 10 times larger scale (128k turns), providing a solid foundation for evaluation. Building on this resource, we decompose hard-to-quantify subjective performance into three progressive tasks: TASK1 basic coherence (whether behavior, emotion, expression, and voice align with classroom context), TASK2 student realism, and TASK3 long-term persona consistency, thereby establishing an evaluation framework grounded in educational theory and research value. We conduct systematic experiments on three representative LLMs, comparing their original versions with ten persona-fine-tuned variants trained on EduPersona. Results show consistent and significant average improvements across all tasks: TASK1 +33.6%, TASK2 +30.6%, and TASK3 +14.9%. These improvements highlight the dataset’s effectiveness and research value, while also revealing the heterogeneous difficulty of persona modeling. In summary, EduPersona delivers the first classroom benchmark centered on subjective abilities, establishes a decoupled and verifiable research paradigm, and we will open-source both the dataset and the framework to support the broader research community in advancing trustworthy and human-like AI for education.

[352] MoME: Estimating Psychological Traits from Gait with Multi-Stage Mixture of Movement Experts

Andy Cǎtrunǎ, Adrian Cosma, Emilian Rǎdoi

Main category: cs.CV

TL;DR: A hierarchical Multi-Stage Mixture of Movement Experts (MoME) architecture is proposed for predicting psychological traits from gait sequences, achieving state-of-the-art performance on 17 psychological traits.

DetailsMotivation: Gait contains rich biometric and behavioral information, but using walking patterns to infer psychological traits remains challenging and underexplored.

Method: MoME processes walking cycles in four stages of movement complexity using lightweight expert models and task-specific gating modules to adaptively weight experts across traits and stages.

Result: Outperforms state-of-the-art gait analysis models with 37.47% weighted F1 score at run level and 44.6% at subject level. Integration of auxiliary tasks (identity, gender, BMI) further improves psychological trait estimation.

Conclusion: Demonstrates viability of multi-task gait-based learning for psychological trait estimation and provides foundation for movement-informed psychological inference research.

Abstract: Gait encodes rich biometric and behavioural information, yet leveraging the manner of walking to infer psychological traits remains a challenging and underexplored problem. We introduce a hierarchical Multi-Stage Mixture of Movement Experts (MoME) architecture for multi-task prediction of psychological attributes from gait sequences represented as 2D poses. MoME processes the walking cycle in four stages of movement complexity, employing lightweight expert models to extract spatio-temporal features and task-specific gating modules to adaptively weight experts across traits and stages. Evaluated on the PsyMo benchmark covering 17 psychological traits, our method outperforms state-of-the-art gait analysis models, achieving a 37.47% weighted F1 score at the run level and 44.6% at the subject level. Our experiments show that integrating auxiliary tasks such as identity recognition, gender prediction, and BMI estimation further improves psychological trait estimation. Our findings demonstrate the viability of multi-task gait-based learning for psychological trait estimation and provide a foundation for future research on movement-informed psychological inference.

[353] ConceptSplit: Decoupled Multi-Concept Personalization of Diffusion Models via Token-wise Adaptation and Attention Disentanglement

Habin Lim, Yeongseob Won, Juwon Seo, Gyeong-Moon Park

Main category: cs.CV

TL;DR: ConceptSplit is a framework that addresses concept mixing in multi-concept personalization for text-to-image diffusion models through Token-wise Value Adaptation (ToVA) during training and Latent Optimization for Disentangled Attention (LODA) during inference.

DetailsMotivation: To solve the problem of concept mixing where multiple learned concepts interfere or blend undesirably in output images when using text-to-image diffusion models for multi-concept personalization.

Method: Two key components: Token-wise Value Adaptation (ToVA) - a training method that adapts only the value projection in cross-attention instead of modifying key projection; and Latent Optimization for Disentangled Attention (LODA) - optimizes input latent during inference to alleviate attention entanglement.

Result: ConceptSplit achieves robust multi-concept personalization and mitigates unintended concept interference, as demonstrated through extensive qualitative and quantitative experiments.

Conclusion: The proposed framework effectively addresses concept mixing in multi-concept personalization by focusing on value projection adaptation during training and latent optimization during inference.

Abstract: In recent years, multi-concept personalization for text-to-image (T2I) diffusion models to represent several subjects in an image has gained much more attention. The main challenge of this task is “concept mixing”, where multiple learned concepts interfere or blend undesirably in the output image. To address this issue, in this paper, we present ConceptSplit, a novel framework to split the individual concepts through training and inference. Our framework comprises two key components. First, we introduce Token-wise Value Adaptation (ToVA), a merging-free training method that focuses exclusively on adapting the value projection in cross-attention. Based on our empirical analysis, we found that modifying the key projection, a common approach in existing methods, can disrupt the attention mechanism and lead to concept mixing. Second, we propose Latent Optimization for Disentangled Attention (LODA), which alleviates attention entanglement during inference by optimizing the input latent. Through extensive qualitative and quantitative experiments, we demonstrate that ConceptSplit achieves robust multi-concept personalization, mitigating unintended concept interference. Code is available at https://github.com/KU-VGI/ConceptSplit

[354] Label-Efficient Cross-Modality Generalization for Liver Segmentation in Multi-Phase MRI

Quang-Khai Bui-Tran, Minh-Toan Dinh, Thanh-Huy Nguyen, Ba-Thinh Lam, Mai-Anh Vu, Ulas Bagci

Main category: cs.CV

TL;DR: A label-efficient liver segmentation method for multi-phase MRI that combines foundation model adaptation with co-training to handle limited labeled data, unlabeled sequences, and vendor/modal variations without spatial registration.

DetailsMotivation: Liver segmentation in multi-phase MRI is crucial for fibrosis assessment, but faces challenges with scarce labeled data, uneven distribution across modalities/vendors, spatial misalignment, and missing phases in real-world clinical settings.

Method: Integrates foundation-scale 3D segmentation backbone with fine-tuning, co-training using cross pseudo supervision to leverage unlabeled volumes, and standardized preprocessing pipeline without requiring spatial registration.

Result: The model learns to generalize across MRI phases and vendors, demonstrating robust segmentation performance in both labeled and unlabeled domains.

Conclusion: The approach shows effectiveness for label-efficient liver segmentation in multi-phase, multi-vendor MRI and highlights the potential of combining foundation model adaptation with co-training for real-world clinical imaging tasks.

Abstract: Accurate liver segmentation in multi-phase MRI is vital for liver fibrosis assessment, yet labeled data is often scarce and unevenly distributed across imaging modalities and vendor systems. We propose a label-efficient segmentation approach that promotes cross-modality generalization under real-world conditions, where GED4 hepatobiliary-phase annotations are limited, non-contrast sequences (T1WI, T2WI, DWI) are unlabeled, and spatial misalignment and missing phases are common. Our method integrates a foundation-scale 3D segmentation backbone adapted via fine-tuning, co-training with cross pseudo supervision to leverage unlabeled volumes, and a standardized preprocessing pipeline. Without requiring spatial registration, the model learns to generalize across MRI phases and vendors, demonstrating robust segmentation performance in both labeled and unlabeled domains. Our results exhibit the effectiveness of our proposed label-efficient baseline for liver segmentation in multi-phase, multi-vendor MRI and highlight the potential of combining foundation model adaptation with co-training for real-world clinical imaging tasks.

[355] ID-Consistent, Precise Expression Generation with Blendshape-Guided Diffusion

Foivos Paraperas Papantoniou, Stefanos Zafeiriou

Main category: cs.CV

TL;DR: A diffusion-based framework for identity-consistent facial expression generation with fine-grained control using FLAME blendshape parameters and reference-based editing.

DetailsMotivation: Human-centric generative models need both identity consistency and precise control over human performance, but current methods struggle with fine-grained expression control without compromising identity.

Method: Built on ID-consistent face foundation model with compositional design featuring expression cross-attention module guided by FLAME blendshape parameters, trained on diverse image/video data with expressive variation, plus pluggable Reference Adapter for real image editing.

Result: Model generalizes beyond basic emotions to subtle micro-expressions and expressive transitions, outperforming existing methods in tailored and identity-consistent expression generation.

Conclusion: The framework successfully achieves faithful reimagining of subjects under any facial expression while maintaining identity consistency and enabling precise control.

Abstract: Human-centric generative models designed for AI-driven storytelling must bring together two core capabilities: identity consistency and precise control over human performance. While recent diffusion-based approaches have made significant progress in maintaining facial identity, achieving fine-grained expression control without compromising identity remains challenging. In this work, we present a diffusion-based framework that faithfully reimagines any subject under any particular facial expression. Building on an ID-consistent face foundation model, we adopt a compositional design featuring an expression cross-attention module guided by FLAME blendshape parameters for explicit control. Trained on a diverse mixture of image and video data rich in expressive variation, our adapter generalizes beyond basic emotions to subtle micro-expressions and expressive transitions, overlooked by prior works. In addition, a pluggable Reference Adapter enables expression editing in real images by transferring the appearance from a reference frame during synthesis. Extensive quantitative and qualitative evaluations show that our model outperforms existing methods in tailored and identity-consistent expression generation. Code and models can be found at https://github.com/foivospar/Arc2Face.

[356] Object-Centric Representation Learning for Enhanced 3D Scene Graph Prediction

KunHo Heo, GiHyun Kim, SuYeon Kim, MyeongAh Cho

Main category: cs.CV

TL;DR: The paper proposes a contrastive pretraining approach for 3D semantic scene graph prediction that focuses on improving object feature quality, which significantly boosts both object classification and relationship prediction performance.

DetailsMotivation: Previous approaches for 3D semantic scene graph prediction rely too heavily on Graph Neural Networks without optimizing object and relationship feature representations, leading to insufficient discriminative capability.

Method: Designs a highly discriminative object feature encoder with contrastive pretraining that decouples object representation learning from scene graph prediction, and effectively combines geometric and semantic features for relationship prediction.

Result: Significantly outperforms previous state-of-the-art methods on the 3DSSG dataset across all evaluation metrics, with substantial performance improvements when plugging the pretrained encoder into existing frameworks.

Conclusion: Object feature quality is critical for 3D scene graph accuracy, and the proposed contrastive pretraining approach with enhanced feature integration effectively addresses previous limitations.

Abstract: 3D Semantic Scene Graph Prediction aims to detect objects and their semantic relationships in 3D scenes, and has emerged as a crucial technology for robotics and AR/VR applications. While previous research has addressed dataset limitations and explored various approaches including Open-Vocabulary settings, they frequently fail to optimize the representational capacity of object and relationship features, showing excessive reliance on Graph Neural Networks despite insufficient discriminative capability. In this work, we demonstrate through extensive analysis that the quality of object features plays a critical role in determining overall scene graph accuracy. To address this challenge, we design a highly discriminative object feature encoder and employ a contrastive pretraining strategy that decouples object representation learning from the scene graph prediction. This design not only enhances object classification accuracy but also yields direct improvements in relationship prediction. Notably, when plugging in our pretrained encoder into existing frameworks, we observe substantial performance improvements across all evaluation metrics. Additionally, whereas existing approaches have not fully exploited the integration of relationship information, we effectively combine both geometric and semantic features to achieve superior relationship prediction. Comprehensive experiments on the 3DSSG dataset demonstrate that our approach significantly outperforms previous state-of-the-art methods. Our code is publicly available at https://github.com/VisualScienceLab-KHU/OCRL-3DSSG-Codes.

[357] Benchmark on Monocular Metric Depth Estimation in Wildlife Setting

Niccolò Niccoli, Lorenzo Seidenari, Ilaria Greco, Francesco Rovero

Main category: cs.CV

TL;DR: First benchmark for monocular metric depth estimation in wildlife monitoring, evaluating 4 state-of-the-art methods and a geometric baseline on camera trap images. Depth Anything V2 performs best with 0.454m MAE, while ZoeDepth shows significant degradation in outdoor environments.

DetailsMotivation: Camera traps are widely used for wildlife monitoring but lack depth information from monocular images. While MDE methods have advanced, their performance in natural wildlife environments hasn't been systematically evaluated.

Method: Evaluated four state-of-the-art MDE methods (Depth Anything V2, ML Depth Pro, ZoeDepth, Metric3D) and geometric baseline on 93 camera trap images with ground truth distances from calibrated ChARUCO patterns. Compared median vs mean depth extraction approaches.

Result: Depth Anything V2 achieved best performance: 0.454m MAE and 0.962 correlation. ZoeDepth performed worst with 3.087m MAE. Median-based depth extraction consistently outperformed mean-based approaches. ZoeDepth fastest (0.17s/image) but least accurate; Depth Anything V2 balanced accuracy and speed (0.22s/image).

Conclusion: Establishes performance baselines for wildlife applications and provides practical guidance for implementing depth estimation in conservation monitoring systems, with Depth Anything V2 recommended for optimal balance of accuracy and speed.

Abstract: Camera traps are widely used for wildlife monitoring, but extracting accurate distance measurements from monocular images remains challenging due to the lack of depth information. While monocular depth estimation (MDE) methods have advanced significantly, their performance in natural wildlife environments has not been systematically evaluated. This work introduces the first benchmark for monocular metric depth estimation in wildlife monitoring conditions. We evaluate four state-of-the-art MDE methods (Depth Anything V2, ML Depth Pro, ZoeDepth, and Metric3D) alongside a geometric baseline on 93 camera trap images with ground truth distances obtained using calibrated ChARUCO patterns. Our results demonstrate that Depth Anything V2 achieves the best overall performance with a mean absolute error of 0.454m and correlation of 0.962, while methods like ZoeDepth show significant degradation in outdoor natural environments (MAE: 3.087m). We find that median-based depth extraction consistently outperforms mean-based approaches across all deep learning methods. Additionally, we analyze computational efficiency, with ZoeDepth being fastest (0.17s per image) but least accurate, while Depth Anything V2 provides an optimal balance of accuracy and speed (0.22s per image). This benchmark establishes performance baselines for wildlife applications and provides practical guidance for implementing depth estimation in conservation monitoring systems.

[358] Anomaly-Aware YOLO: A Frugal yet Robust Approach to Infrared Small Target Detection

Alina Ciocarlan, Sylvie Le Hégarat-Mascle, Sidonie Lefebvre

Main category: cs.CV

TL;DR: AA-YOLO integrates statistical anomaly detection into YOLO’s detection head to improve infrared small target detection by treating targets as anomalies against background, achieving competitive performance with low false alarm rates and good robustness.

DetailsMotivation: Conventional object detectors struggle with infrared small target detection due to complex backgrounds and tiny target sizes, leading to high false alarm rates that limit practical deployment in defense applications.

Method: Proposed Anomaly-Aware YOLO (AA-YOLO) that integrates a statistical anomaly detection test into the YOLO detection head, treating small targets as unexpected patterns against the background to control false alarms.

Result: Achieves competitive performance on IRSTD benchmarks with remarkable robustness in limited training data, noise, and domain shift scenarios. Successfully applied across various YOLO backbones including lightweight models and instance segmentation YOLO.

Conclusion: AA-YOLO provides a generic, versatile solution for real-world IRSTD deployments with constrained resources, offering low false alarm rates and good performance across different scenarios and model architectures.

Abstract: Infrared Small Target Detection (IRSTD) is a challenging task in defense applications, where complex backgrounds and tiny target sizes often result in numerous false alarms using conventional object detectors. To overcome this limitation, we propose Anomaly-Aware YOLO (AA-YOLO), which integrates a statistical anomaly detection test into its detection head. By treating small targets as unexpected patterns against the background, AA-YOLO effectively controls the false alarm rate. Our approach not only achieves competitive performance on several IRSTD benchmarks, but also demonstrates remarkable robustness in scenarios with limited training data, noise, and domain shifts. Furthermore, since only the detection head is modified, our design is highly generic and has been successfully applied across various YOLO backbones, including lightweight models. It also provides promising results when integrated into an instance segmentation YOLO. This versatility makes AA-YOLO an attractive solution for real-world deployments where resources are constrained. The code will be publicly released.

[359] Beyond Appearance: Transformer-based Person Identification from Conversational Dynamics

Masoumeh Chapariniya, Teodora Vukovic, Sarah Ebling, Volker Dellwo

Main category: cs.CV

TL;DR: Transformer-based two-stream framework for person identification using spatial and temporal keypoint features achieves 98.03% accuracy, showing spatial configurations are more discriminative than temporal dynamics.

DetailsMotivation: To investigate transformer architectures for person identification in natural face-to-face conversation scenarios, exploring how spatial postural configurations and temporal motion patterns contribute to identity recognition.

Method: Two-stream framework using 133 COCO WholeBody keypoints from CANDOR corpus: spatial transformer for body configurations and multi-scale temporal transformer for hierarchical motion modeling. Compared pre-trained vs from-scratch training, velocity features, and feature-level fusion.

Result: Spatial transformer achieved 95.74% accuracy, temporal transformer 93.90%. Feature-level fusion boosted performance to 98.03%. Domain-specific training outperformed transfer learning, and spatial features were more discriminative than temporal dynamics.

Conclusion: Transformers are effective for person identification in natural interactions, with spatial and temporal information being complementary. Provides foundation for future multimodal and cross-cultural studies.

Abstract: This paper investigates the performance of transformer-based architectures for person identification in natural, face-to-face conversation scenario. We implement and evaluate a two-stream framework that separately models spatial configurations and temporal motion patterns of 133 COCO WholeBody keypoints, extracted from a subset of the CANDOR conversational corpus. Our experiments compare pre-trained and from-scratch training, investigate the use of velocity features, and introduce a multi-scale temporal transformer for hierarchical motion modeling. Results demonstrate that domain-specific training significantly outperforms transfer learning, and that spatial configurations carry more discriminative information than temporal dynamics. The spatial transformer achieves 95.74% accuracy, while the multi-scale temporal transformer achieves 93.90%. Feature-level fusion pushes performance to 98.03%, confirming that postural and dynamic information are complementary. These findings highlight the potential of transformer architectures for person identification in natural interactions and provide insights for future multimodal and cross-cultural studies.

[360] Visual Representations inside the Language Model

Benlin Liu, Amita Kamath, Madeleine Grunde-McLaughlin, Winson Han, Ranjay Krishna

Main category: cs.CV

TL;DR: The paper analyzes why Multimodal Language Models (MLMs) struggle with perception-heavy tasks by examining visual key-value tokens in popular MLMs, revealing that visual information flow and artifacts in later layers reduce perception capabilities.

DetailsMotivation: To understand why MLMs perform poorly on perception-heavy tasks despite extensive interpretability work on VIT encoders and transformer activations, focusing on the under-studied role of visual key-value tokens.

Method: Analyzed visual information flow through language models in popular MLMs (LLaVA-OneVision, Qwen2.5-VL, Llama-3-LLaVA-NeXT), studied key-value token processing, tested zero-shot perception tasks, and experimented with text prefix additions to control visual information.

Result: Found that image value tokens contain sufficient information for perception tasks, but language model augmentation reduces visual information compared to original visual encoders. Later layer key tokens contain artifacts that harm perception. Adding text prefixes improves perception, and 33.3% of Art Style questions in BLINK benchmark show unused perception information in language models.

Conclusion: Visual key-value tokens play crucial roles in MLM perception capabilities, with artifacts in later layers reducing performance. Better control of visual information could significantly improve perception, suggesting new training directions for visual encoders and language model components.

Abstract: Despite interpretability work analyzing VIT encoders and transformer activations, we don’t yet understand why Multimodal Language Models (MLMs) struggle on perception-heavy tasks. We offer an under-studied perspective by examining how popular MLMs (LLaVA-OneVision, Qwen2.5-VL, and Llama-3-LLaVA-NeXT) process their visual key-value tokens. We first study the flow of visual information through the language model, finding that image value tokens encode sufficient information to perform several perception-heavy tasks zero-shot: segmentation, semantic correspondence, temporal correspondence, and referring expression detection. We find that while the language model does augment the visual information received from the projection of input visual encodings-which we reveal correlates with overall MLM perception capability-it contains less visual information on several tasks than the equivalent visual encoder (SigLIP) that has not undergone MLM finetuning. Further, we find that the visual information corresponding to input-agnostic image key tokens in later layers of language models contains artifacts which reduce perception capability of the overall MLM. Next, we discuss controlling visual information in the language model, showing that adding a text prefix to the image input improves perception capabilities of visual representations. Finally, we reveal that if language models were able to better control their visual information, their perception would significantly improve; e.g., in 33.3% of Art Style questions in the BLINK benchmark, perception information present in the language model is not surfaced to the output! Our findings reveal insights into the role of key-value tokens in multimodal systems, paving the way for deeper mechanistic interpretability of MLMs and suggesting new directions for training their visual encoder and language model components.

[361] Progressive Gaussian Transformer with Anisotropy-aware Sampling for Open Vocabulary Occupancy Prediction

Chi Yan, Dan Xu

Main category: cs.CV

TL;DR: PG-Occ is a Progressive Gaussian Transformer Framework for open-vocabulary 3D occupancy prediction that uses progressive online densification and anisotropy-aware sampling to achieve state-of-the-art performance.

DetailsMotivation: To address the trade-off in text-aligned scene modeling where sparse Gaussian representation struggles with small objects while dense representation has high computational costs.

Method: Uses progressive online densification to gradually enhance 3D Gaussian representation, and anisotropy-aware sampling with spatio-temporal fusion to adaptively assign receptive fields.

Result: Achieves state-of-the-art performance with 14.3% mIoU improvement over previous best method.

Conclusion: PG-Occ effectively balances computational efficiency and fine-grained scene detail capture for open-vocabulary 3D occupancy prediction.

Abstract: The 3D occupancy prediction task has witnessed remarkable progress in recent years, playing a crucial role in vision-based autonomous driving systems. While traditional methods are limited to fixed semantic categories, recent approaches have moved towards predicting text-aligned features to enable open-vocabulary text queries in real-world scenes. However, there exists a trade-off in text-aligned scene modeling: sparse Gaussian representation struggles to capture small objects in the scene, while dense representation incurs significant computational overhead. To address these limitations, we present PG-Occ, an innovative Progressive Gaussian Transformer Framework that enables open-vocabulary 3D occupancy prediction. Our framework employs progressive online densification, a feed-forward strategy that gradually enhances the 3D Gaussian representation to capture fine-grained scene details. By iteratively enhancing the representation, the framework achieves increasingly precise and detailed scene understanding. Another key contribution is the introduction of an anisotropy-aware sampling strategy with spatio-temporal fusion, which adaptively assigns receptive fields to Gaussians at different scales and stages, enabling more effective feature aggregation and richer scene information capture. Through extensive evaluations, we demonstrate that PG-Occ achieves state-of-the-art performance with a relative 14.3% mIoU improvement over the previous best performing method. Code and pretrained models will be released upon publication on our project page: https://yanchi-3dv.github.io/PG-Occ

[362] Beyond the Seen: Bounded Distribution Estimation for Open-Vocabulary Learning

Xiaomeng Fan, Yuchuan Mao, Zhi Gao, Yuwei Wu, Jin Chen, Yunde Jia

Main category: cs.CV

TL;DR: The paper proposes a novel open-vocabulary learning method that generates unseen-class data using a class-domain-wise pipeline guided by hierarchical semantic trees and domain information, enabling effective distribution estimation and improved generalization.

DetailsMotivation: Existing methods for open-vocabulary learning only use seen-class data to estimate distributions in open environments, but the absence of unseen classes makes the estimation error inherently unidentifiable. Learning beyond seen classes is crucial for bounding this error.

Method: The method consists of: 1) A class-domain-wise data generation pipeline that generates unseen-class data guided by hierarchical semantic trees and domain information from seen-class data, 2) A distribution alignment algorithm that estimates and maximizes posterior probability using the generated data.

Result: Extensive experiments on 11 datasets show the method outperforms baseline approaches by up to 14%, demonstrating its effectiveness and superiority in open-vocabulary learning.

Conclusion: The paper theoretically and empirically shows that generating unseen-class data enables effective distribution estimation in open environments, with the proposed method providing significant performance improvements over existing approaches.

Abstract: Open-vocabulary learning requires modeling the data distribution in open environments, which consists of both seen-class and unseen-class data. Existing methods estimate the distribution in open environments using seen-class data, where the absence of unseen classes makes the estimation error inherently unidentifiable. Intuitively, learning beyond the seen classes is crucial for distribution estimation to bound the estimation error. We theoretically demonstrate that the distribution can be effectively estimated by generating unseen-class data, through which the estimation error is upper-bounded. Building on this theoretical insight, we propose a novel open-vocabulary learning method, which generates unseen-class data for estimating the distribution in open environments. The method consists of a class-domain-wise data generation pipeline and a distribution alignment algorithm. The data generation pipeline generates unseen-class data under the guidance of a hierarchical semantic tree and domain information inferred from the seen-class data, facilitating accurate distribution estimation. With the generated data, the distribution alignment algorithm estimates and maximizes the posterior probability to enhance generalization in open-vocabulary learning. Extensive experiments on $11$ datasets demonstrate that our method outperforms baseline approaches by up to $14%$, highlighting its effectiveness and superiority.

[363] Federated Learning for Surgical Vision in Appendicitis Classification: Results of the FedSurg EndoVis 2024 Challenge

Max Kirchner, Hanna Hoffmann, Alexander C. Jenke, Oliver L. Saldanha, Kevin Pfeiffer, Weam Kanjo, Julia Alekseenko, Claas de Boer, Santhi Raj Kolamuri, Lorenzo Mazza, Nicolas Padoy, Sophia Bano, Annika Reinke, Lena Maier-Hein, Danail Stoyanov, Jakob N. Kather, Fiona R. Kolbinger, Sebastian Bodenstedt, Stefanie Speidel

Main category: cs.CV

TL;DR: The FedSurg challenge benchmarked federated learning for surgical video classification using the Appendix300 dataset, evaluating generalization to unseen centers and local adaptation through fine-tuning.

DetailsMotivation: To assess how well current FL methods generalize to unseen clinical centers and adapt through local fine-tuning while enabling collaborative model development without sharing patient data.

Method: Participants used foundation models with linear probing, metric learning with triplet loss, and FL aggregation schemes (FedAvg, FedMedian, FedSAM) to classify inflammation stages in appendicitis videos. Performance was evaluated using F1-score and Expected Cost.

Result: Generalization to unseen centers was limited, but all teams improved after fine-tuning in the adaptation task. ViViT-based submission achieved strongest performance. Challenges included sensitivity to class imbalance and hyperparameter tuning difficulties.

Conclusion: Establishes first FL benchmark for surgical video classification, highlighting trade-offs between local personalization and global robustness, and emphasizing importance of architecture choice, preprocessing, and loss design for future clinical AI development.

Abstract: Purpose: The FedSurg challenge was designed to benchmark the state of the art in federated learning for surgical video classification. Its goal was to assess how well current methods generalize to unseen clinical centers and adapt through local fine-tuning while enabling collaborative model development without sharing patient data. Methods: Participants developed strategies to classify inflammation stages in appendicitis using a preliminary version of the multi-center Appendix300 video dataset. The challenge evaluated two tasks: generalization to an unseen center and center-specific adaptation after fine-tuning. Submitted approaches included foundation models with linear probing, metric learning with triplet loss, and various FL aggregation schemes (FedAvg, FedMedian, FedSAM). Performance was assessed using F1-score and Expected Cost, with ranking robustness evaluated via bootstrapping and statistical testing. Results: In the generalization task, performance across centers was limited. In the adaptation task, all teams improved after fine-tuning, though ranking stability was low. The ViViT-based submission achieved the strongest overall performance. The challenge highlighted limitations in generalization, sensitivity to class imbalance, and difficulties in hyperparameter tuning in decentralized training, while spatiotemporal modeling and context-aware preprocessing emerged as promising strategies. Conclusion: The FedSurg Challenge establishes the first benchmark for evaluating FL strategies in surgical video classification. Findings highlight the trade-off between local personalization and global robustness, and underscore the importance of architecture choice, preprocessing, and loss design. This benchmarking offers a reference point for future development of imbalance-aware, adaptive, and robust FL methods in clinical surgical AI.

[364] Hands-Free Heritage: Automated 3D Scanning for Cultural Heritage Digitization

Javed Ahmad, Federico Dassiè, Selene Frascella, Gabriele Marchello, Ferdinando Cannella, Arianna Traviglia

Main category: cs.CV

TL;DR: Automated two-robot system for high-fidelity 3D scanning of cultural heritage artefacts, eliminating manual intervention through coordinated robotic manipulation and optimized trajectory planning.

DetailsMotivation: Conventional 3D scanning methods for cultural heritage preservation require specialized expertise and manual intervention, which limits efficiency and accessibility.

Method: Two-robot system combining scanner-equipped robot and tray-handling robot with coordinated motion planning, parameterized scanning space, optimized trajectory planning, and waypoint distribution for comprehensive coverage.

Result: Achieves significantly lower Chamfer Distance and higher F-score compared to baseline methods, offering superior geometric accuracy and improved digitization efficiency.

Conclusion: The automated system provides high-fidelity 3D scanning with reduced reliance on expert operators, making cultural heritage digitization more efficient and accessible.

Abstract: High-fidelity 3D scanning is essential for preserving cultural heritage artefacts, supporting documentation, analysis, and long-term conservation. However, conventional methods typically require specialized expertise and manual intervention to maintain optimal scanning conditions and coverage. We present an automated two-robot scanning system that eliminates the need for handheld or semi-automatic workflows by combining coordinated robotic manipulation with high-resolution 3D scanning. Our system parameterizes the scanning space into distinct regions, enabling coordinated motion planning between a scanner-equipped robot and a tray-handling robot. Optimized trajectory planning and waypoint distribution ensure comprehensive surface coverage, minimize occlusions, and balance reconstruction accuracy with system efficiency. Experimental results show that our approach achieves significantly lower Chamfer Distance and higher F-score compared to baseline methods, offering superior geometric accuracy, improved digitization efficiency, and reduced reliance on expert operators.

[365] A Comparative Study of Vision Transformers and CNNs for Few-Shot Rigid Transformation and Fundamental Matrix Estimation

Alon Kaya, Igal Bilik, Inna Stainvas

Main category: cs.CV

TL;DR: Vision-transformers (ViTs) outperform CNNs in large-data scenarios for geometric estimation tasks, but CNNs match ViTs in small-data settings due to their inductive bias and smaller capacity. ViTs show better cross-domain generalization.

DetailsMotivation: To compare the efficiency of ViTs and large-scale CNNs as backbone architectures for geometric estimation tasks (2D rigid transformations and fundamental matrix prediction) in various data size settings, particularly in low-data regimes.

Method: Systematic comparison of large-scale CNNs (ResNet, EfficientNet, CLIP-ResNet) with ViT-based foundation models (CLIP-ViT variants and DINO) across different data size settings including few-shot scenarios. Analysis of model refinement performance in both large and small downstream-data scenarios.

Result: ViTs outperform CNNs during refinement in large downstream-data scenarios, similar to training from scratch. In small data scenarios, CNNs match ViT performance due to their inductive bias and smaller capacity. ViTs exhibit stronger generalization in cross-domain evaluation.

Conclusion: Careful selection of model architectures for refinement is crucial. Future research should focus on developing hybrid architectures that balance local and global representations for geometric estimation tasks.

Abstract: Vision-transformers (ViTs) and large-scale convolution-neural-networks (CNNs) have reshaped computer vision through pretrained feature representations that enable strong transfer learning for diverse tasks. However, their efficiency as backbone architectures for geometric estimation tasks involving image deformations in low-data regimes remains an open question. This work considers two such tasks: 1) estimating 2D rigid transformations between pairs of images and 2) predicting the fundamental matrix for stereo image pairs, an important problem in various applications, such as autonomous mobility, robotics, and 3D scene reconstruction. Addressing this intriguing question, this work systematically compares large-scale CNNs (ResNet, EfficientNet, CLIP-ResNet) with ViT-based foundation models (CLIP-ViT variants and DINO) in various data size settings, including few-shot scenarios. These pretrained models are optimized for classification or contrastive learning, encouraging them to focus mostly on high-level semantics. The considered tasks require balancing local and global features differently, challenging the straightforward adoption of these models as the backbone. Empirical comparative analysis shows that, similar to training from scratch, ViTs outperform CNNs during refinement in large downstream-data scenarios. However, in small data scenarios, the inductive bias and smaller capacity of CNNs improve their performance, allowing them to match that of a ViT. Moreover, ViTs exhibit stronger generalization in cross-domain evaluation where the data distribution changes. These results emphasize the importance of carefully selecting model architectures for refinement, motivating future research towards hybrid architectures that balance local and global representations.

[366] DiT-VTON: Diffusion Transformer Framework for Unified Multi-Category Virtual Try-On and Virtual Try-All with Integrated Image Editing

Qi Li, Shuwen Qiu, Julien Han, Xingzi Xu, Mehmet Saygin Seyfioglu, Kee Kiat Koo, Karim Bouyarmane

Main category: cs.CV

TL;DR: DiT-VTON is a Virtual Try-On framework using Diffusion Transformers that achieves state-of-the-art performance with superior detail preservation and robustness, while extending capabilities to Virtual Try-All across diverse product categories.

DetailsMotivation: Address limitations in existing VTO models including poor fine-grained detail preservation, lack of robustness to real-world imagery, inefficient sampling, limited image editing capabilities, and poor generalization across product categories.

Method: Leverages Diffusion Transformer (DiT) adapted for image-conditioned VTO, exploring multiple configurations (in-context token concatenation, channel concatenation, ControlNet integration). Trained on expanded dataset with varied backgrounds, unstructured references, and non-garment categories.

Result: Surpasses state-of-the-art methods on VITON-HD with superior detail preservation and robustness. Outperforms models with VTA and image editing capabilities across thousands of product categories without additional condition encoders.

Conclusion: DiT-VTON successfully redefines VTO as Virtual Try-All, offering versatile capabilities for diverse product categories and advanced image editing functionalities while demonstrating the benefits of data scaling for adaptability.

Abstract: The rapid growth of e-commerce has intensified the demand for Virtual Try-On (VTO) technologies, enabling customers to realistically visualize products overlaid on their own images. Despite recent advances, existing VTO models face challenges with fine-grained detail preservation, robustness to real-world imagery, efficient sampling, image editing capabilities, and generalization across diverse product categories. In this paper, we present DiT-VTON, a novel VTO framework that leverages a Diffusion Transformer (DiT), renowned for its performance on text-conditioned image generation, adapted here for the image-conditioned VTO task. We systematically explore multiple DiT configurations, including in-context token concatenation, channel concatenation, and ControlNet integration, to determine the best setup for VTO image conditioning. To enhance robustness, we train the model on an expanded dataset encompassing varied backgrounds, unstructured references, and non-garment categories, demonstrating the benefits of data scaling for VTO adaptability. DiT-VTON also redefines the VTO task beyond garment try-on, offering a versatile Virtual Try-All (VTA) solution capable of handling a wide range of product categories and supporting advanced image editing functionalities such as pose preservation, localized editing, texture transfer, and object-level customization. Experimental results show that our model surpasses state-of-the-art methods on VITON-HD, achieving superior detail preservation and robustness without reliance on additional condition encoders. It also outperforms models with VTA and image editing capabilities on a diverse dataset spanning thousands of product categories.

[367] Did you just see that? Arbitrary view synthesis for egocentric replay of operating room workflows from ambient sensors

Han Zhang, Lalithkumar Seenivasan, Jose L. Porras, Roger D. Soberanis-Mukul, Hao Ding, Hongchao Shu, Benjamin D. Killeen, Ankita Ghosh, Lonny Yarmus, Masaru Ishii, Angela Christine Argento, Mathias Unberath

Main category: cs.CV

TL;DR: EgoSurg reconstructs egocentric visual perspectives of surgical team members from fixed-camera video using neural rendering and diffusion enhancement, enabling immersive surgical analysis without disrupting clinical workflow.

DetailsMotivation: Traditional surgical observation methods rely on fixed viewpoints or recollections, missing the actual visual perspectives that guide clinical decisions, which limits insights into surgical safety, training, and workflow optimization.

Method: Couples geometry-driven neural rendering with diffusion-based view enhancement to synthesize arbitrary egocentric viewpoints from wall-mounted fixed-camera video.

Result: Successfully reconstructs person-specific visual fields and arbitrary viewpoints with high visual quality and fidelity across multi-site surgical cases and controlled studies.

Conclusion: Transforms existing OR camera infrastructure into navigable dynamic 3D records, establishing a foundation for immersive surgical data science that enables surgical practice to be visualized and analyzed from every angle.

Abstract: Observing surgical practice has historically relied on fixed vantage points or recollections, leaving the egocentric visual perspectives that guide clinical decisions undocumented. Fixed-camera video can capture surgical workflows at the room-scale, but cannot reconstruct what each team member actually saw. Thus, these videos only provide limited insights into how decisions that affect surgical safety, training, and workflow optimization are made. Here we introduce EgoSurg, the first framework to reconstruct the dynamic, egocentric replays for any operating room (OR) staff directly from wall-mounted fixed-camera video, and thus, without intervention to clinical workflow. EgoSurg couples geometry-driven neural rendering with diffusion-based view enhancement, enabling high-visual fidelity synthesis of arbitrary and egocentric viewpoints at any moment. In evaluation across multi-site surgical cases and controlled studies, EgoSurg reconstructs person-specific visual fields and arbitrary viewpoints with high visual quality and fidelity. By transforming existing OR camera infrastructure into a navigable dynamic 3D record, EgoSurg establishes a new foundation for immersive surgical data science, enabling surgical practice to be visualized, experienced, and analyzed from every angle.

[368] AvatarVTON: 4D Virtual Try-On for Animatable Avatars

Zicheng Jiang, Jixin Gao, Shengfeng He, Xinzhe Li, Yulong Zheng, Zhaotong Yang, Junyu Dong, Yong Du

Main category: cs.CV

TL;DR: AvatarVTON is the first 4D virtual try-on framework that generates realistic try-on results from a single garment image, supporting free pose control, novel-view rendering, and dynamic garment interactions without multi-view captures or physics priors.

DetailsMotivation: To overcome limitations of existing virtual try-on methods by enabling dynamic garment interactions and novel-view rendering from single-view supervision, without requiring multi-view garment captures or physics priors.

Method: Uses two key modules: (1) Reciprocal Flow Rectifier for optical-flow correction to stabilize avatar fitting and ensure temporal coherence, and (2) Non-Linear Deformer that decomposes Gaussian maps into view-pose-invariant and view-pose-specific components for adaptive garment deformations.

Result: Extensive experiments show AvatarVTON achieves high fidelity, diversity, and dynamic garment realism. The framework establishes a benchmark for 4D virtual try-on with fair qualitative and quantitative comparisons.

Conclusion: AvatarVTON is well-suited for AR/VR, gaming, and digital-human applications, demonstrating superior performance in generating realistic 4D virtual try-on experiences from single garment images.

Abstract: We propose AvatarVTON, the first 4D virtual try-on framework that generates realistic try-on results from a single in-shop garment image, enabling free pose control, novel-view rendering, and diverse garment choices. Unlike existing methods, AvatarVTON supports dynamic garment interactions under single-view supervision, without relying on multi-view garment captures or physics priors. The framework consists of two key modules: (1) a Reciprocal Flow Rectifier, a prior-free optical-flow correction strategy that stabilizes avatar fitting and ensures temporal coherence; and (2) a Non-Linear Deformer, which decomposes Gaussian maps into view-pose-invariant and view-pose-specific components, enabling adaptive, non-linear garment deformations. To establish a benchmark for 4D virtual try-on, we extend existing baselines with unified modules for fair qualitative and quantitative comparisons. Extensive experiments show that AvatarVTON achieves high fidelity, diversity, and dynamic garment realism, making it well-suited for AR/VR, gaming, and digital-human applications.

[369] Flow Matching for Conditional MRI-CT and CBCT-CT Image Synthesis

Arnela Hadzic, Simon Johannes Joham, Martin Urschler

Main category: cs.CV

TL;DR: The paper proposes a 3D Flow Matching framework for generating synthetic CT from MRI or CBCT to enable MRI-only and CBCT-based adaptive radiotherapy, improving treatment precision while reducing radiation exposure.

DetailsMotivation: To enable MRI-only and CBCT-based adaptive radiotherapy by generating synthetic CT images, which improves treatment precision and reduces patient radiation exposure.

Method: A fully 3D Flow Matching framework that transforms Gaussian noise into synthetic CT images by integrating a learned velocity field, conditioned on features from input MRI or CBCT using a lightweight 3D encoder. Separate models were trained for MRI→sCT and CBCT→sCT across abdomen, head and neck, and thorax regions.

Result: The method accurately reconstructs global anatomical structures but has limited preservation of fine details due to low training resolution constraints from memory and runtime limitations.

Conclusion: Future work will explore patch-based training and latent-space flow models to improve resolution and local structural fidelity for better synthetic CT generation.

Abstract: Generating synthetic CT (sCT) from MRI or CBCT plays a crucial role in enabling MRI-only and CBCT-based adaptive radiotherapy, improving treatment precision while reducing patient radiation exposure. To address this task, we adopt a fully 3D Flow Matching (FM) framework, motivated by recent work demonstrating FM’s efficiency in producing high-quality images. In our approach, a Gaussian noise volume is transformed into an sCT image by integrating a learned FM velocity field, conditioned on features extracted from the input MRI or CBCT using a lightweight 3D encoder. We evaluated the method on the SynthRAD2025 Challenge benchmark, training separate models for MRI $\rightarrow$ sCT and CBCT $\rightarrow$ sCT across three anatomical regions: abdomen, head and neck, and thorax. Validation and testing were performed through the challenge submission system. The results indicate that the method accurately reconstructs global anatomical structures; however, preservation of fine details was limited, primarily due to the relatively low training resolution imposed by memory and runtime constraints. Future work will explore patch-based training and latent-space flow models to improve resolution and local structural fidelity.

[370] Beyond Random: Automatic Inner-loop Optimization in Dataset Distillation

Muquan Li, Hang Gou, Dongyang Zhang, Shuang Liang, Xiurui Xie, Deqiang Ouyang, Ke Qin

Main category: cs.CV

TL;DR: AT-BPTT is a novel dataset distillation framework that dynamically adapts truncation positions and window sizes based on gradient behavior, achieving state-of-the-art performance with significant speed and memory improvements.

DetailsMotivation: Existing inner-loop optimization methods for dataset distillation rely on random truncation strategies, which lack flexibility and yield suboptimal results due to neural networks' distinct learning dynamics across different training stages.

Method: Proposes Automatic Truncated Backpropagation Through Time (AT-BPTT) with three key components: probabilistic stage-aware timestep selection, adaptive window sizing based on gradient variation, and low-rank Hessian approximation for computational efficiency.

Result: AT-BPTT achieves state-of-the-art performance, improving accuracy by an average of 6.16% over baseline methods, accelerates inner-loop optimization by 3.9x, and saves 63% memory cost on datasets including CIFAR-10, CIFAR-100, Tiny-ImageNet, and ImageNet-1K.

Conclusion: AT-BPTT effectively addresses the limitations of random truncation in dataset distillation by dynamically adapting to neural network learning dynamics, achieving superior performance and efficiency.

Abstract: The growing demand for efficient deep learning has positioned dataset distillation as a pivotal technique for compressing training dataset while preserving model performance. However, existing inner-loop optimization methods for dataset distillation typically rely on random truncation strategies, which lack flexibility and often yield suboptimal results. In this work, we observe that neural networks exhibit distinct learning dynamics across different training stages-early, middle, and late-making random truncation ineffective. To address this limitation, we propose Automatic Truncated Backpropagation Through Time (AT-BPTT), a novel framework that dynamically adapts both truncation positions and window sizes according to intrinsic gradient behavior. AT-BPTT introduces three key components: (1) a probabilistic mechanism for stage-aware timestep selection, (2) an adaptive window sizing strategy based on gradient variation, and (3) a low-rank Hessian approximation to reduce computational overhead. Extensive experiments on CIFAR-10, CIFAR-100, Tiny-ImageNet, and ImageNet-1K show that AT-BPTT achieves state-of-the-art performance, improving accuracy by an average of 6.16% over baseline methods. Moreover, our approach accelerates inner-loop optimization by 3.9x while saving 63% memory cost.

[371] Detailed Aerial Mapping of Photovoltaic Power Plants Through Semantically Significant Keypoints

Viktor Kozák, Jan Chudoba, Libor Přeučil

Main category: cs.CV

TL;DR: A novel method for automated photovoltaic power plant mapping using aerial images to create detailed 3D models down to individual PV modules without relying on third-party data.

DetailsMotivation: Accurate and up-to-date PV power plant models are essential for optimal operation and maintenance, but such models are often not easily available, creating a need for automated mapping solutions.

Method: Uses visual segmentation of PV modules in aerial overview images, infers structural information to assign modules to benches/rows/columns, identifies visual keypoints for layout, and merges detections from multiple images while maintaining structural integrity.

Result: Successfully tested on two different power plants, resulting in compact georeferenced 3D models with semantic structures suitable for maintenance applications.

Conclusion: The approach enables automated PV power plant mapping with detailed structural modeling down to individual module level, eliminating reliance on third-party data sources.

Abstract: An accurate and up-to-date model of a photovoltaic (PV) power plant is essential for its optimal operation and maintenance. However, such a model may not be easily available. This work introduces a novel approach for PV power plant mapping based on aerial overview images. It enables the automation of the mapping process while removing the reliance on third-party data. The presented mapping method takes advantage of the structural layout of the power plants to achieve detailed modeling down to the level of individual PV modules. The approach relies on visual segmentation of PV modules in overview images and the inference of structural information in each image, assigning modules to individual benches, rows, and columns. We identify visual keypoints related to the layout and use these to merge detections from multiple images while maintaining their structural integrity. The presented method was experimentally verified and evaluated on two different power plants. The final fusion of 3D positions and semantic structures results in a compact georeferenced model suitable for power plant maintenance.

[372] From Actions to Kinesics: Extracting Human Psychological States through Bodily Movements

Cheyu Lin, Katherine A. Flanigan

Main category: cs.CV

TL;DR: A kinesics recognition framework using ST-GCN and CNN to infer psychological states from 3D skeleton data, enabling privacy-preserving and scalable human behavior modeling.

DetailsMotivation: To overcome limitations of traditional methods (theoretical models, questionnaires) in capturing human psychological states for human-environment interaction modeling, which are limited in scope, static, and labor-intensive.

Method: Combines spatial-temporal graph convolutional network (ST-GCN) with convolutional neural network (CNN) using transfer learning on 3D skeleton joint data to infer kinesics (communicative functions of human activity) without manual mappings.

Result: Demonstrated on Dyadic User Engagement (DUET) dataset, the method enables scalable, accurate, and human-centered modeling of behavior while preserving user anonymity.

Conclusion: Provides a new pathway for enhancing RL-driven simulations of human-environment interaction by uncovering latent structures in bodily movements that reflect cognitive and emotional states.

Abstract: Understanding the dynamic relationship between humans and the built environment is a key challenge in disciplines ranging from environmental psychology to reinforcement learning (RL). A central obstacle in modeling these interactions is the inability to capture human psychological states in a way that is both generalizable and privacy preserving. Traditional methods rely on theoretical models or questionnaires, which are limited in scope, static, and labor intensive. We present a kinesics recognition framework that infers the communicative functions of human activity – known as kinesics – directly from 3D skeleton joint data. Combining a spatial-temporal graph convolutional network (ST-GCN) with a convolutional neural network (CNN), the framework leverages transfer learning to bypass the need for manually defined mappings between physical actions and psychological categories. The approach preserves user anonymity while uncovering latent structures in bodily movements that reflect cognitive and emotional states. Our results on the Dyadic User EngagemenT (DUET) dataset demonstrate that this method enables scalable, accurate, and human-centered modeling of behavior, offering a new pathway for enhancing RL-driven simulations of human-environment interaction.

[373] Read the Room: Inferring Social Context Through Dyadic Interaction Recognition in Cyber-physical-social Infrastructure Systems

Cheyu Lin, John Martins, Katherine A. Flanigan, Ph. D

Main category: cs.CV

TL;DR: This paper compares skeleton-based algorithms for recognizing dyadic human interactions using depth sensors as a privacy-preserving alternative to RGB cameras, analyzing 12 interaction types from a dataset focused on cultural and emotional communication.

DetailsMotivation: Cyber-physical systems traditionally focus on economic goals but neglect social benefits. Cyber-physical-social infrastructure systems aim to address this gap by aligning with social objectives, requiring better understanding of human interactions while preserving privacy.

Method: The study compares five skeleton-based interaction recognition algorithms on a dataset of 12 dyadic interactions. Depth sensors are used instead of RGB cameras to address privacy concerns, analyzing skeletal movements to recognize interactions categorized into communication types like emblems and affect displays.

Result: The research provides a comparative analysis of five skeleton-based algorithms for dyadic interaction recognition, using a dataset specifically designed for understanding cultural and emotional aspects of human interactions through privacy-preserving depth sensing.

Conclusion: Skeleton-based interaction recognition using depth sensors offers a viable privacy-preserving approach for understanding human interactions, laying the foundation for cyber-physical-social infrastructure systems that can better address social objectives while respecting privacy concerns.

Abstract: Cyber-physical systems (CPS) integrate sensing, computing, and control to improve infrastructure performance, focusing on economic goals like performance and safety. However, they often neglect potential human-centered (or ‘‘social’’) benefits. Cyber-physical-social infrastructure systems (CPSIS) aim to address this by aligning CPS with social objectives. This involves defining social benefits, understanding human interactions with each other and infrastructure, developing privacy-preserving measurement methods, modeling these interactions for prediction, linking them to social benefits, and actuating the physical environment to foster positive social outcomes. This paper delves into recognizing dyadic human interactions using real-world data, which is the backbone to measuring social behavior. This lays a foundation to address the need to enhance understanding of the deeper meanings and mutual responses inherent in human interactions. While RGB cameras are informative for interaction recognition, privacy concerns arise. Depth sensors offer a privacy-conscious alternative by analyzing skeletal movements. This study compares five skeleton-based interaction recognition algorithms on a dataset of 12 dyadic interactions. Unlike single-person datasets, these interactions, categorized into communication types like emblems and affect displays, offer insights into the cultural and emotional aspects of human interactions.

[374] ERDE: Entropy-Regularized Distillation for Early-exit

Martial Guidez, Stefan Duffner, Yannick Alpou, Oscar Röth, Christophe Garcia

Main category: cs.CV

TL;DR: A method combining early exits and knowledge distillation to reduce computational costs in neural networks while maintaining accuracy, using a new entropy-based loss for incorrectly classified images.

DetailsMotivation: Deep neural networks have high computational costs that make them impractical for real-time and edge applications, requiring compression techniques that maintain accuracy while reducing complexity.

Method: Integrates early exits and knowledge distillation, training a reduced student early-exit model from a complex teacher early-exit model with a new entropy-based loss for images where teacher classification was incorrect.

Result: Achieves significant reductions in computational complexity without compromising classification performance on CIFAR10, CIFAR100 and SVHN datasets.

Conclusion: The approach effectively optimizes the accuracy-efficiency trade-off and opens new research perspectives for Knowledge Distillation in other contexts.

Abstract: Although deep neural networks and in particular Convolutional Neural Networks have demonstrated state-of-the-art performance in image classification with relatively high efficiency, they still exhibit high computational costs, often rendering them impractical for real-time and edge applications. Therefore, a multitude of compression techniques have been developed to reduce these costs while maintaining accuracy. In addition, dynamic architectures have been introduced to modulate the level of compression at execution time, which is a desirable property in many resource-limited application scenarios. The proposed method effectively integrates two well-established optimization techniques: early exits and knowledge distillation, where a reduced student early-exit model is trained from a more complex teacher early-exit model. The primary contribution of this research lies in the approach for training the student early-exit model. In comparison to the conventional Knowledge Distillation loss, our approach incorporates a new entropy-based loss for images where the teacher’s classification was incorrect. The proposed method optimizes the trade-off between accuracy and efficiency, thereby achieving significant reductions in computational complexity without compromising classification performance. The validity of this approach is substantiated by experimental results on image classification datasets CIFAR10, CIFAR100 and SVHN, which further opens new research perspectives for Knowledge Distillation in other contexts.

[375] μDeepIQA: deep learning-based fast and robust image quality assessment with local predictions for optical microscopy

Elena Corbetta, Thomas Bocklitz

Main category: cs.CV

TL;DR: μDeepIQA is a deep learning-based image quality assessment method for optical microscopy that provides fast, stable quality predictions and patch-wise quality visualization, overcoming limitations of traditional metrics.

DetailsMotivation: Traditional image quality assessment methods for optical microscopy are computationally expensive for large datasets and unstable for images outside ideal domains, requiring more robust and efficient solutions.

Method: The approach retrains a deep convolutional neural network architecture designed for natural image IQA on optical microscopy data to predict both individual quality metrics and global quality scores.

Result: μDeepIQA provides fast and stable quality predictions that generalize well even outside standard method ranges, with the ability to visualize spatially varying quality through patch-wise predictions.

Conclusion: Deep learning models like μDeepIQA benefit optical microscopy studies through stable outlier performance, small patch assessment capability, and rapid predictions, enhancing generalizability over traditional methods.

Abstract: Optical microscopy is one of the most widely used techniques in research studies for life sciences and biomedicine. These applications require reliable experimental pipelines to extract valuable knowledge from the measured samples and must be supported by image quality assessment (IQA) to ensure correct processing and analysis of the image data. IQA methods are implemented with variable complexity. However, while most quality metrics have a straightforward implementation, they might be time consuming and computationally expensive when evaluating a large dataset. In addition, quality metrics are often designed for well-defined image features and may be unstable for images out of the ideal domain. To overcome these limitations, recent works have proposed deep learning-based IQA methods, which can provide superior performance, increased generalizability and fast prediction. Our method, named $\mathrm{\mu}$DeepIQA, is inspired by previous studies and applies a deep convolutional neural network designed for IQA on natural images to optical microscopy measurements. We retrained the same architecture to predict individual quality metrics and global quality scores for optical microscopy data. The resulting models provide fast and stable predictions of image quality by generalizing quality estimation even outside the ideal range of standard methods. In addition, $\mathrm{\mu}$DeepIQA provides patch-wise prediction of image quality and can be used to visualize spatially varying quality in a single image. Our study demonstrates that optical microscopy-based studies can benefit from the generalizability of deep learning models due to their stable performance in the presence of outliers, the ability to assess small image patches, and rapid predictions.

[376] In-Field Mapping of Grape Yield and Quality with Illumination-Invariant Deep Learning

Ciem Cornelissen, Sander De Coninck, Axel Willekens, Sam Leroux, Pieter Simoens

Main category: cs.CV

TL;DR: An IoT-enabled robotic system for real-time mapping of grape yield and quality in vineyards using hyperspectral imaging and deep learning, with a novel domain-adversarial framework to handle illumination variations.

DetailsMotivation: To enable non-destructive, real-time, and spatially-resolved mapping of grape yield and quality in vineyards for precision viticulture, overcoming the challenge of domain shift caused by variable illumination in field conditions.

Method: End-to-end system with two modules: high-performance model for grape bunch detection and weight estimation, and Light-Invariant Spectral Autoencoder (LISA) - a domain-adversarial framework that learns illumination-invariant features from uncalibrated hyperspectral data across different lighting conditions.

Result: Complete pipeline achieves 0.82 recall for bunch detection and R² of 0.76 for weight prediction. LISA module improves quality prediction generalization by over 20% compared to baselines, validated across three illumination domains (lab, morning sunlight, afternoon sunlight).

Conclusion: The system successfully generates high-resolution, georeferenced data of both grape yield and quality, providing actionable insights for precision viticulture by overcoming illumination challenges through robust domain-invariant learning.

Abstract: This paper presents an end-to-end, IoT-enabled robotic system for the non-destructive, real-time, and spatially-resolved mapping of grape yield and quality (Brix, Acidity) in vineyards. The system features a comprehensive analytical pipeline that integrates two key modules: a high-performance model for grape bunch detection and weight estimation, and a novel deep learning framework for quality assessment from hyperspectral (HSI) data. A critical barrier to in-field HSI is the ``domain shift" caused by variable illumination. To overcome this, our quality assessment is powered by the Light-Invariant Spectral Autoencoder (LISA), a domain-adversarial framework that learns illumination-invariant features from uncalibrated data. We validated the system’s robustness on a purpose-built HSI dataset spanning three distinct illumination domains: controlled artificial lighting (lab), and variable natural sunlight captured in the morning and afternoon. Results show the complete pipeline achieves a recall (0.82) for bunch detection and a $R^2$ (0.76) for weight prediction, while the LISA module improves quality prediction generalization by over 20% compared to the baselines. By combining these robust modules, the system successfully generates high-resolution, georeferenced data of both grape yield and quality, providing actionable, data-driven insights for precision viticulture.

[377] BenthiCat: An opti-acoustic dataset for advancing benthic classification and habitat mapping

Hayat Rajani, Valerio Franchi, Borja Martinez-Clavel Valles, Raimon Ramos, Rafael Garcia, Nuno Gracias

Main category: cs.CV

TL;DR: A comprehensive multi-modal dataset for benthic habitat mapping with ~1M side-scan sonar tiles, bathymetric maps, and optical images, including 36K annotated tiles and tools for cross-modal learning.

DetailsMotivation: Addressing the scarcity of large annotated datasets for marine habitat mapping to enable development and benchmarking of machine learning models in this domain.

Method: Collection of multi-modal data including side-scan sonar tiles, bathymetric maps, and optical images from AUV surveys, with spatial association between optical images and SSS tiles for cross-modal representation learning.

Result: Created a standardized dataset with ~1M SSS tiles, bathymetric maps, optical images, and 36K annotated segmentation masks, plus open-source preprocessing and annotation tools.

Conclusion: This resource establishes a benchmark for underwater habitat mapping and promotes advancements in autonomous seafloor classification and multi-sensor integration.

Abstract: Benthic habitat mapping is fundamental for understanding marine ecosystems, guiding conservation efforts, and supporting sustainable resource management. Yet, the scarcity of large, annotated datasets limits the development and benchmarking of machine learning models in this domain. This paper introduces a thorough multi-modal dataset, comprising about a million side-scan sonar (SSS) tiles collected along the coast of Catalonia (Spain), complemented by bathymetric maps and a set of co-registered optical images from targeted surveys using an autonomous underwater vehicle (AUV). Approximately \num{36000} of the SSS tiles have been manually annotated with segmentation masks to enable supervised fine-tuning of classification models. All the raw sensor data, together with mosaics, are also released to support further exploration and algorithm development. To address challenges in multi-sensor data fusion for AUVs, we spatially associate optical images with corresponding SSS tiles, facilitating self-supervised, cross-modal representation learning. Accompanying open-source preprocessing and annotation tools are provided to enhance accessibility and encourage research. This resource aims to establish a standardized benchmark for underwater habitat mapping, promoting advancements in autonomous seafloor classification and multi-sensor integration.

[378] Comparative Analysis of YOLOv5, Faster R-CNN, SSD, and RetinaNet for Motorbike Detection in Kigali Autonomous Driving Context

Ngeyen Yinkfu, Sunday Nwovu, Jonathan Kayizzi, Angelique Uwamahoro

Main category: cs.CV

TL;DR: Comparison of four object detection models (YOLOv5, Faster R-CNN, SSD, RetinaNet) for motorbike detection in Kigali, Rwanda, using a custom dataset to evaluate their suitability for autonomous driving systems in resource-constrained environments.

DetailsMotivation: Motorcycle taxis in Kigali navigate unpredictably and violate traffic rules, creating significant challenges for autonomous driving systems that need reliable detection capabilities.

Method: Used four object detection models implemented in PyTorch with transfer learning on a custom dataset of 198 images collected in Kigali. Evaluated models based on accuracy, localization, and inference speed.

Result: The study identified implementation challenges including dataset limitations and model complexities. Performance comparisons were made across the four models for real-time navigation suitability.

Conclusion: Recommended simplified architectures for future work to enhance accessibility of autonomous systems in developing countries like Rwanda, addressing resource constraints and implementation challenges.

Abstract: In Kigali, Rwanda, motorcycle taxis are a primary mode of transportation, often navigating unpredictably and disregarding traffic rules, posing significant challenges for autonomous driving systems. This study compares four object detection models–YOLOv5, Faster R-CNN, SSD, and RetinaNet–for motorbike detection using a custom dataset of 198 images collected in Kigali. Implemented in PyTorch with transfer learning, the models were evaluated for accuracy, localization, and inference speed to assess their suitability for real-time navigation in resource-constrained settings. We identify implementation challenges, including dataset limitations and model complexities, and recommend simplified architectures for future work to enhance accessibility for autonomous systems in developing countries like Rwanda.

[379] A Semantics-Aware Hierarchical Self-Supervised Approach to Classification of Remote Sensing Images

Giulio Weikmann, Gianmarco Perantoni, Lorenzo Bruzzone

Main category: cs.CV

TL;DR: Proposes SAHC method for hierarchical remote sensing image classification using trainable hierarchy matrices and consensus mechanism to leverage semantic relationships between classes.

DetailsMotivation: Existing deep learning approaches overlook predefined label hierarchies in remote sensing classification, focusing only on fine-grained schemes without exploiting semantic relationships.

Method: Integrates hierarchy-specific classification heads with trainable hierarchy matrices for self-supervised hierarchical learning, plus hierarchical consensus mechanism for probability distribution consistency across levels.

Result: Effective in guiding network learning and robust for hierarchical classification tasks across three benchmark datasets with different hierarchical complexity and backbone architectures.

Conclusion: SAHC method successfully leverages hierarchical structure in remote sensing image classification, demonstrating effectiveness and adaptability across various datasets and architectures.

Abstract: Deep learning has become increasingly important in remote sensing image classification due to its ability to extract semantic information from complex data. Classification tasks often include predefined label hierarchies that represent the semantic relationships among classes. However, these hierarchies are frequently overlooked, and most approaches focus only on fine-grained classification schemes. In this paper, we present a novel Semantics-Aware Hierarchical Consensus (SAHC) method for learning hierarchical features and relationships by integrating hierarchy-specific classification heads within a deep network architecture, each specialized in different degrees of class granularity. The proposed approach employs trainable hierarchy matrices, which guide the network through the learning of the hierarchical structure in a self-supervised manner. Furthermore, we introduce a hierarchical consensus mechanism to ensure consistent probability distributions across different hierarchical levels. This mechanism acts as a weighted ensemble being able to effectively leverage the inherent structure of the hierarchical classification task. The proposed SAHC method is evaluated on three benchmark datasets with different degrees of hierarchical complexity on different tasks, using distinct backbone architectures to effectively emphasize its adaptability. Experimental results show both the effectiveness of the proposed approach in guiding network learning and the robustness of the hierarchical consensus for remote sensing image classification tasks.

[380] REN: Anatomically-Informed Mixture-of-Experts for Interstitial Lung Disease Diagnosis

Alec K. Peltekian, Halil Ertugrul Aktas, Gorkem Durak, Kevin Grudzinski, Bradford C. Bemiss, Carrie Richardson, Jane E. Dematte, G. R. Scott Budinger, Anthony J. Esposito, Alexander Misharin, Alok Choudhary, Ankit Agrawal, Ulas Bagci

Main category: cs.CV

TL;DR: REN is an anatomically-informed Mixture-of-Experts framework for medical image classification that uses anatomical priors to train specialized experts for different lung regions, achieving superior performance in interstitial lung disease classification.

DetailsMotivation: Traditional MoE systems lack domain-specific constraints needed for medical imaging, where anatomical structure and regional disease heterogeneity strongly influence pathological patterns.

Method: Leverages anatomical priors to train seven specialized experts for distinct lung lobes and bilateral lung combinations. Uses multi-modal gating mechanisms that integrate radiomics biomarkers and deep learning features (CNN, ViT, Mamba) to optimally weight expert contributions.

Result: Achieved average AUC of 0.8646 ± 0.0467, a 12.5% improvement over SwinUNETR baseline (AUC 0.7685, p=0.031). Region-specific experts showed lower-lobe models achieving AUCs of 0.88-0.90, surpassing DL counterparts (CNN: 0.76-0.79).

Conclusion: REN demonstrates strong generalizability and clinical interpretability, presenting a scalable, anatomically-guided approach extensible to other structured medical imaging applications.

Abstract: Mixture-of-Experts (MoE) architectures have significantly contributed to scalable machine learning by enabling specialized subnetworks to tackle complex tasks efficiently. However, traditional MoE systems lack domain-specific constraints essential for medical imaging, where anatomical structure and regional disease heterogeneity strongly influence pathological patterns. Here, we introduce Regional Expert Networks (REN), the first anatomically-informed MoE framework tailored specifically for medical image classification. REN leverages anatomical priors to train seven specialized experts, each dedicated to distinct lung lobes and bilateral lung combinations, enabling precise modeling of region-specific pathological variations. Multi-modal gating mechanisms dynamically integrate radiomics biomarkers and deep learning (DL) features (CNN, ViT, Mamba) to weight expert contributions optimally. Applied to interstitial lung disease (ILD) classification, REN achieves consistently superior performance: the radiomics-guided ensemble reached an average AUC of 0.8646 +/- 0.0467, a +12.5 percent improvement over the SwinUNETR baseline (AUC 0.7685, p = 0.031). Region-specific experts further revealed that lower-lobe models achieved AUCs of 0.88-0.90, surpassing DL counterparts (CNN: 0.76-0.79) and aligning with known disease progression patterns. Through rigorous patient-level cross-validation, REN demonstrates strong generalizability and clinical interpretability, presenting a scalable, anatomically-guided approach readily extensible to other structured medical imaging applications.

[381] Unsupervised Active Learning via Natural Feature Progressive Framework

Yuxi Liu, Catherine Lalman, Yimin Yang

Main category: cs.CV

TL;DR: NFPF is a novel Unsupervised Active Learning method that uses Specific Feature Learning Machine to measure sample importance and Reconstruction Difference metric for selection, achieving performance comparable to supervised AL methods.

DetailsMotivation: Traditional Active Learning requires expensive human annotation in iterative steps, while existing Unsupervised AL methods struggle with performance due to local gradient-based scoring and shallow selection approaches.

Method: Proposes Natural Feature Progressive Framework (NFPF) with Specific Feature Learning Machine (SFLM) to quantify sample contribution to model performance and uses Reconstruction Difference metric for sample selection.

Result: NFPF significantly outperforms all established UAL methods and achieves performance on par with supervised AL methods on vision datasets, with superior robustness and better data distribution coverage.

Conclusion: NFPF revolutionizes sample importance measurement in UAL, providing a more effective framework that reduces annotation burden while maintaining high performance comparable to supervised approaches.

Abstract: The effectiveness of modern deep learning models is predicated on the availability of large-scale, human-annotated datasets, a process that is notoriously expensive and time-consuming. While Active Learning (AL) offers a strategic solution by labeling only the most informative and representative data, its iterative nature still necessitates significant human involvement. Unsupervised Active Learning (UAL) presents an alternative by shifting the annotation burden to a single, post-selection step. Unfortunately, prevailing UAL methods struggle to achieve state-of-the-art performance. These approaches typically rely on local, gradient-based scoring for sample importance estimation, which not only makes them vulnerable to ambiguous and noisy data but also hinders their capacity to select samples that adequately represent the full data distribution. Moreover, their use of shallow, one-shot linear selection falls short of a true UAL paradigm. In this paper, we propose the Natural Feature Progressive Framework (NFPF), a UAL method that revolutionizes how sample importance is measured. At its core, NFPF employs a Specific Feature Learning Machine (SFLM) to effectively quantify each sample’s contribution to model performance. We further utilize the SFLM to define a powerful Reconstruction Difference metric for initial sample selection. Our comprehensive experiments show that NFPF significantly outperforms all established UAL methods and achieves performance on par with supervised AL methods on vision datasets. Detailed ablation studies and qualitative visualizations provide compelling evidence for NFPF’s superior performance, enhanced robustness, and improved data distribution coverage.

[382] Bidirectional Mammogram View Translation with Column-Aware and Implicit 3D Conditional Diffusion

Xin Li, Kaixiang Yang, Qiang Li, Zhiwei Wang

Main category: cs.CV

TL;DR: CA3D-Diff is a novel bidirectional mammogram view translation framework that uses conditional diffusion models with column-aware cross-attention and implicit 3D structure reconstruction to address view misalignment and improve breast cancer diagnosis when one mammography view is missing.

DetailsMotivation: In real-world clinical workflows, one mammography view (CC or MLO) may be missing, corrupted, or degraded due to acquisition errors or compression artifacts, limiting diagnostic effectiveness. View-to-view translation can help recover missing views and improve lesion alignment, but this is challenging due to large non-rigid deformations and severe tissue overlap in X-ray projections.

Method: Proposes CA3D-Diff framework with two key components: 1) Column-aware cross-attention mechanism that leverages geometric properties where anatomically corresponding regions lie in similar column positions across views, using Gaussian-decayed bias to emphasize local correlations; 2) Implicit 3D structure reconstruction module that back-projects noisy 2D latents into coarse 3D feature volume based on breast-view projection geometry, then refines and injects it into the denoising UNet.

Result: Extensive experiments show CA3D-Diff achieves superior performance in bidirectional view translation tasks, outperforming state-of-the-art methods in visual fidelity and structural consistency. The synthesized views effectively improve single-view malignancy classification in screening settings.

Conclusion: CA3D-Diff demonstrates practical value in real-world diagnostics by enabling effective view translation when one mammography view is unavailable, improving both image quality and downstream diagnostic tasks like malignancy classification.

Abstract: Dual-view mammography, including craniocaudal (CC) and mediolateral oblique (MLO) projections, offers complementary anatomical views crucial for breast cancer diagnosis. However, in real-world clinical workflows, one view may be missing, corrupted, or degraded due to acquisition errors or compression artifacts, limiting the effectiveness of downstream analysis. View-to-view translation can help recover missing views and improve lesion alignment. Unlike natural images, this task in mammography is highly challenging due to large non-rigid deformations and severe tissue overlap in X-ray projections, which obscure pixel-level correspondences. In this paper, we propose Column-Aware and Implicit 3D Diffusion (CA3D-Diff), a novel bidirectional mammogram view translation framework based on conditional diffusion model. To address cross-view structural misalignment, we first design a column-aware cross-attention mechanism that leverages the geometric property that anatomically corresponding regions tend to lie in similar column positions across views. A Gaussian-decayed bias is applied to emphasize local column-wise correlations while suppressing distant mismatches. Furthermore, we introduce an implicit 3D structure reconstruction module that back-projects noisy 2D latents into a coarse 3D feature volume based on breast-view projection geometry. The reconstructed 3D structure is refined and injected into the denoising UNet to guide cross-view generation with enhanced anatomical awareness. Extensive experiments demonstrate that CA3D-Diff achieves superior performance in bidirectional tasks, outperforming state-of-the-art methods in visual fidelity and structural consistency. Furthermore, the synthesized views effectively improve single-view malignancy classification in screening settings, demonstrating the practical value of our method in real-world diagnostics.

[383] SSDD: Single-Step Diffusion Decoder for Efficient Image Tokenization

Théophane Vallaeys, Jakob Verbeek, Matthieu Cord

Main category: cs.CV

TL;DR: SSDD is a new diffusion decoder architecture that uses distillation to achieve single-step reconstruction without adversarial losses, outperforming KL-VAE in both reconstruction quality and speed.

DetailsMotivation: Current tokenizers based on KL-VAE have limitations including the need for adversarial losses and slow iterative sampling. Diffusion decoders offer a principled alternative but still face similar issues.

Method: Introduces a pixel diffusion decoder with transformer components and GAN-free training, then uses distillation to create an efficient single-step decoder (SSDD).

Result: SSDD improves reconstruction FID from 0.87 to 0.50 with 1.4× higher throughput, and preserves DiT generation quality with 3.8× faster sampling.

Conclusion: SSDD can serve as a drop-in replacement for KL-VAE, enabling higher-quality and faster generative models without adversarial training.

Abstract: Tokenizers are a key component of state-of-the-art generative image models, extracting the most important features from the signal while reducing data dimension and redundancy. Most current tokenizers are based on KL-regularized variational autoencoders (KL-VAE), trained with reconstruction, perceptual and adversarial losses. Diffusion decoders have been proposed as a more principled alternative to model the distribution over images conditioned on the latent. However, matching the performance of KL-VAE still requires adversarial losses, as well as a higher decoding time due to iterative sampling. To address these limitations, we introduce a new pixel diffusion decoder architecture for improved scaling and training stability, benefiting from transformer components and GAN-free training. We use distillation to replicate the performance of the diffusion decoder in an efficient single-step decoder. This makes SSDD the first diffusion decoder optimized for single-step reconstruction trained without adversarial losses, reaching higher reconstruction quality and faster sampling than KL-VAE. In particular, SSDD improves reconstruction FID from $0.87$ to $0.50$ with $1.4\times$ higher throughput and preserve generation quality of DiTs with $3.8\times$ faster sampling. As such, SSDD can be used as a drop-in replacement for KL-VAE, and for building higher-quality and faster generative models.

[384] ActiveMark: on watermarking of visual foundation models via massive activations

Anna Chistyakova, Mikhail Pautov

Main category: cs.CV

TL;DR: Proposes a watermarking method for visual foundation models (VFMs) to protect intellectual property by embedding detectable watermarks in internal representations, maintaining detectability even after fine-tuning.

DetailsMotivation: Protect VFM intellectual property rights against illegal redistribution by dishonest users, as current models lack reliable ownership verification tools.

Method: Fine-tune expressive layers of VFM with encoder-decoder network to embed digital watermarks into internal representations of hold-out input images.

Result: Watermarks remain detectable in functional copies after fine-tuning, with low false detection and misdetection probabilities demonstrated theoretically and experimentally.

Conclusion: The proposed method provides effective ownership verification for VFMs, protecting against unauthorized redistribution while maintaining model functionality.

Abstract: Being trained on large and vast datasets, visual foundation models (VFMs) can be fine-tuned for diverse downstream tasks, achieving remarkable performance and efficiency in various computer vision applications. The high computation cost of data collection and training motivates the owners of some VFMs to distribute them alongside the license to protect their intellectual property rights. However, a dishonest user of the protected model’s copy may illegally redistribute it, for example, to make a profit. As a consequence, the development of reliable ownership verification tools is of great importance today, since such methods can be used to differentiate between a redistributed copy of the protected model and an independent model. In this paper, we propose an approach to ownership verification of visual foundation models by fine-tuning a small set of expressive layers of a VFM along with a small encoder-decoder network to embed digital watermarks into an internal representation of a hold-out set of input images. Importantly, the watermarks embedded remain detectable in the functional copies of the protected model, obtained, for example, by fine-tuning the VFM for a particular downstream task. Theoretically and experimentally, we demonstrate that the proposed method yields a low probability of false detection of a non-watermarked model and a low probability of false misdetection of a watermarked model.

[385] Latent Uncertainty Representations for Video-based Driver Action and Intention Recognition

Koen Vellenga, H. Joe Steinhauer, Jonas Andersson, Anders Sjögren

Main category: cs.CV

TL;DR: Proposes latent uncertainty representation (LUR) and repulsively trained LUR (RLUR) methods for uncertainty estimation in deep neural networks, achieving comparable performance to existing probabilistic deep learning methods while being more efficient for out-of-distribution detection.

DetailsMotivation: Deep neural networks are used in safety-critical tasks but struggle with uncertainty estimation and out-of-distribution detection. Last layer probabilistic deep learning methods exist but have varying performance and computational requirements.

Method: Extends pre-trained DNNs with transformation layers to produce multiple latent representations for uncertainty estimation. Compares LUR and RLUR against eight probabilistic deep learning methods across four driver action and intention recognition datasets.

Result: LUR and RLUR achieve comparable in-distribution classification performance to other approaches. For uncertainty-based OOD detection, LUR matches top-performing methods while being more efficient to train and easier to tune than approaches requiring Markov-Chain Monte Carlo sampling or repulsive training.

Conclusion: The proposed LUR and RLUR methods provide effective uncertainty estimation for safety-critical applications with better efficiency and ease of tuning compared to existing probabilistic deep learning approaches.

Abstract: Deep neural networks (DNNs) are increasingly applied to safety-critical tasks in resource-constrained environments, such as video-based driver action and intention recognition. While last layer probabilistic deep learning (LL-PDL) methods can detect out-of-distribution (OOD) instances, their performance varies. As an alternative to last layer approaches, we propose extending pre-trained DNNs with transformation layers to produce multiple latent representations to estimate the uncertainty. We evaluate our latent uncertainty representation (LUR) and repulsively trained LUR (RLUR) approaches against eight PDL methods across four video-based driver action and intention recognition datasets, comparing classification performance, calibration, and uncertainty-based OOD detection. We also contribute 28,000 frame-level action labels and 1,194 video-level intention labels for the NuScenes dataset. Our results show that LUR and RLUR achieve comparable in-distribution classification performance to other LL-PDL approaches. For uncertainty-based OOD detection, LUR matches top-performing PDL methods while being more efficient to train and easier to tune than approaches that require Markov-Chain Monte Carlo sampling or repulsive training procedures.

[386] Exploring the Efficacy of Modified Transfer Learning in Identifying Parkinson’s Disease Through Drawn Image Patterns

Nabil Daiyan, Md Rakibul Haque

Main category: cs.CV

TL;DR: Machine learning approach using hand-drawn spiral and wave images achieves 93.3% accuracy for Parkinson’s disease detection through CNN, transfer learning, and ensemble voting methods.

DetailsMotivation: Early PD diagnosis is crucial but traditional methods are cumbersome and costly. Hand-drawn images offer a non-invasive, cost-effective biomarker alternative.

Method: Three-phase architecture: pre-trained CNNs, custom convolutional layers with attention mechanisms, and ensemble hard voting. Dataset augmentation for spiral and wave images.

Result: Spiral images: 90% precision, recall, F1-score. Wave images: 96.67% metrics. Combined ensemble voting: 93.3% overall accuracy.

Conclusion: Machine learning with hand-drawn images shows strong potential for early PD diagnosis, providing a non-invasive and cost-effective solution.

Abstract: Parkinson’s disease (PD) is a progressive neurodegenerative condition characterized by the death of dopaminergic neurons, leading to various movement disorder symptoms. Early diagnosis of PD is crucial to prevent adverse effects, yet traditional diagnostic methods are often cumbersome and costly. In this study, a machine learning-based approach is proposed using hand-drawn spiral and wave images as potential biomarkers for PD detection. Our methodology leverages convolutional neural networks (CNNs), transfer learning, and attention mechanisms to improve model performance and resilience against overfitting. To enhance the diversity and richness of both spiral and wave categories, the training dataset undergoes augmentation to increase the number of images. The proposed architecture comprises three phases: utilizing pre-trained CNNs, incorporating custom convolutional layers, and ensemble voting. Employing hard voting further enhances performance by aggregating predictions from multiple models. Experimental results show promising accuracy rates. For spiral images, weighted average precision, recall, and F1-score are 90%, and for wave images, they are 96.67%. After combining the predictions through ensemble hard voting, the overall accuracy is 93.3%. These findings underscore the potential of machine learning in early PD diagnosis, offering a non-invasive and cost-effective solution to improve patient outcomes.

[387] Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models

Yunlong Tang, Jing Bi, Pinxin Liu, Zhenyu Pan, Zhangyun Tan, Qianxiang Shen, Jiani Liu, Hang Hua, Junjia Guo, Yunzhong Xiao, Chao Huang, Zhiyuan Wang, Susan Liang, Xinyi Liu, Yizhi Song, Yuhe Nie, Jia-Xing Zhong, Bozheng Li, Daiqing Qi, Ziyun Zeng, Ali Vosoughi, Luchuan Song, Zeliang Zhang, Daiki Shimada, Han Liu, Jiebo Luo, Chenliang Xu

Main category: cs.CV

TL;DR: This survey provides the first comprehensive examination of post-training methodologies for Video-Large Multimodal Models (Video-LMMs), covering supervised fine-tuning, reinforcement learning, and test-time scaling techniques.

DetailsMotivation: Video understanding is challenging but Video-LMMs show promise. However, the post-training phase that transforms these models into sophisticated reasoning engines remains fragmented in literature.

Method: Systematic analysis of three fundamental post-training pillars: supervised fine-tuning with chain-of-thought, reinforcement learning from verifiable objectives, and test-time scaling through enhanced inference computation.

Result: Presents a structured taxonomy clarifying roles, interconnections, and video-specific adaptations of these techniques, addressing challenges like temporal localization and spatiotemporal grounding.

Conclusion: Provides researchers with a unified framework for advancing Video-LMM capabilities, including curated benchmarks and metrics for rigorous assessment of post-training effectiveness.

Abstract: Video understanding represents the most challenging frontier in computer vision, requiring models to reason about complex spatiotemporal relationships, long-term dependencies, and multimodal evidence. The recent emergence of Video-Large Multimodal Models (Video-LMMs), which integrate visual encoders with powerful decoder-based language models, has demonstrated remarkable capabilities in video understanding tasks. However, the critical phase that transforms these models from basic perception systems into sophisticated reasoning engines, post-training, remains fragmented across the literature. This survey provides the first comprehensive examination of post-training methodologies for Video-LMMs, encompassing three fundamental pillars: supervised fine-tuning (SFT) with chain-of-thought, reinforcement learning (RL) from verifiable objectives, and test-time scaling (TTS) through enhanced inference computation. We present a structured taxonomy that clarifies the roles, interconnections, and video-specific adaptations of these techniques, addressing unique challenges such as temporal localization, spatiotemporal grounding, long video efficiency, and multimodal evidence integration. Through systematic analysis of representative methods, we synthesize key design principles, insights, and evaluation protocols while identifying critical open challenges in reward design, scalability, and cost-performance optimization. We further curate essential benchmarks, datasets, and metrics to facilitate rigorous assessment of post-training effectiveness. This survey aims to provide researchers and practitioners with a unified framework for advancing Video-LMM capabilities. Additional resources and updates are maintained at: https://github.com/yunlong10/Awesome-Video-LMM-Post-Training

[388] SegMASt3R: Geometry Grounded Segment Matching

Rohit Jayanti, Swayam Agrawal, Vansh Garg, Siddharth Tourani, Muhammad Haris Khan, Sourav Garg, Madhava Krishna

Main category: cs.CV

TL;DR: Leveraging 3D foundation models for wide-baseline segment matching across extreme viewpoint changes, achieving 30% improvement over state-of-the-art methods.

DetailsMotivation: Segment matching captures structured regions more robustly than keypoint matching for handling occlusions, lighting variations, and viewpoint changes, especially in challenging wide-baseline scenarios.

Method: Proposed architecture using inductive bias from 3D foundation models to match segments across image pairs with up to 180-degree viewpoint changes.

Result: Outperforms state-of-the-art methods (SAM2 video propagator and local feature matching) by up to 30% on AUPRC metric on ScanNet++ and Replica datasets, with demonstrated benefits for 3D instance segmentation and image-goal navigation.

Conclusion: The approach successfully leverages 3D foundation models to achieve superior segment matching performance in extreme viewpoint scenarios, enabling better downstream applications.

Abstract: Segment matching is an important intermediate task in computer vision that establishes correspondences between semantically or geometrically coherent regions across images. Unlike keypoint matching, which focuses on localized features, segment matching captures structured regions, offering greater robustness to occlusions, lighting variations, and viewpoint changes. In this paper, we leverage the spatial understanding of 3D foundation models to tackle wide-baseline segment matching, a challenging setting involving extreme viewpoint shifts. We propose an architecture that uses the inductive bias of these 3D foundation models to match segments across image pairs with up to 180 degree view-point change. Extensive experiments show that our approach outperforms state-of-the-art methods, including the SAM2 video propagator and local feature matching methods, by upto 30% on the AUPRC metric, on ScanNet++ and Replica datasets. We further demonstrate benefits of the proposed model on relevant downstream tasks, including 3D instance segmentation and image-goal navigation. Project Page: https://segmast3r.github.io/

[389] No-reference Quality Assessment of Contrast-distorted Images using Contrast-enhanced Pseudo Reference

Mohammad-Ali Mahmoudpour, Saeed Mahmoudpour

Main category: cs.CV

TL;DR: A no-reference image quality assessment method for contrast-distorted images that transforms the problem into full-reference assessment by generating pseudo-reference images using contrast enhancement algorithms.

DetailsMotivation: Contrast change significantly affects image quality but has been largely overlooked in image quality assessment research compared to other distortions like blur and noise.

Method: Uses contrast enhancement algorithms to generate pseudo-reference images, trains a classification network to select the best algorithm based on image content and distortion, then performs full-reference assessment between pseudo-reference and degraded images.

Result: The method shows promising performance when evaluated on three databases containing contrast distortions (CCID2014, TID2013, and CSIQ).

Conclusion: The proposed approach effectively addresses contrast distortion assessment by transforming no-reference problems into more accurate full-reference evaluations through intelligent pseudo-reference generation.

Abstract: Contrast change is an important factor that affects the quality of images. During image capturing, unfavorable lighting conditions can cause contrast change and visual quality loss. While various methods have been proposed to assess the quality of images under different distortions such as blur and noise, contrast distortion has been largely overlooked as its visual impact and properties are different from other conventional types of distortions. In this paper, we propose a no-reference image quality assessment (NR-IQA) metric for contrast-distorted images. Using a set of contrast enhancement algorithms, we aim to generate pseudo-reference images that are visually close to the actual reference image, such that the NR problem is transformed to a Full-reference (FR) assessment with higher accuracy. To this end, a large dataset of contrast-enhanced images is produced to train a classification network that can select the most suitable contrast enhancement algorithm based on image content and distortion for pseudo-reference image generation. Finally, the evaluation is performed in the FR manner to assess the quality difference between the contrast-enhanced (pseudoreference) and degraded images. Performance evaluation of the proposed method on three databases containing contrast distortions (CCID2014, TID2013, and CSIQ), indicates the promising performance of the proposed method.

[390] Neuroplastic Modular Framework: Cross-Domain Image Classification of Garbage and Industrial Surfaces

Debojyoti Ghosh, Soumya K Ghosh, Adrijit Goswami

Main category: cs.CV

TL;DR: A novel Neuroplastic Modular Classifier combines ResNet-50 and Vision Transformer with FAISS-based memory retrieval for adaptive image classification in waste management and industrial defect detection, outperforming traditional static models.

DetailsMotivation: Efficient waste classification and industrial surface defect detection are essential for sustainable waste management and quality control, requiring robust and adaptive image classification systems for dynamic environments.

Method: Hybrid architecture combining ResNet-50 for localized features and Vision Transformer for global context, with FAISS-based similarity retrieval and neuroplastic modular design featuring expandable blocks that grow during training when performance plateaus.

Result: Outperforms traditional static models in both accuracy and adaptability, validated on waste classification and KolektorSDD2 industrial defect dataset.

Conclusion: The Neuroplastic Modular Classifier provides a scalable, high-performance solution for real-world image classification with strong applicability in environmental and industrial domains.

Abstract: Efficient and accurate classification of waste and industrial surface defects is essential for ensuring sustainable waste management and maintaining high standards in quality control. This paper introduces the Neuroplastic Modular Classifier, a novel hybrid architecture designed for robust and adaptive image classification in dynamic environments. The model combines a ResNet-50 backbone for localized feature extraction with a Vision Transformer (ViT) to capture global semantic context. Additionally, FAISS-based similarity retrieval is incorporated to provide a memory-like reference to previously encountered data, enriching the model’s feature space. A key innovation of our architecture is the neuroplastic modular design composed of expandable, learnable blocks that dynamically grow during training when performance plateaus. Inspired by biological learning systems, this mechanism allows the model to adapt to data complexity over time, improving generalization. Beyond garbage classification, we validate the model on the Kolektor Surface Defect Dataset 2 (KolektorSDD2), which involves industrial defect detection on metal surfaces. Experimental results across domains show that the proposed architecture outperforms traditional static models in both accuracy and adaptability. The Neuroplastic Modular Classifier offers a scalable, high-performance solution for real-world image classification, with strong applicability in both environmental and industrial domains.

[391] Factuality Matters: When Image Generation and Editing Meet Structured Visuals

Le Zhuo, Songhao Han, Yuandong Pu, Boxiang Qiu, Sayak Paul, Yue Liao, Yihao Liu, Jie Shao, Xi Chen, Si Liu, Hongsheng Li

Main category: cs.CV

TL;DR: This paper addresses the challenge of generating and editing structured visuals like charts and diagrams by introducing a comprehensive framework including a large dataset, unified model training, and evaluation benchmark.

DetailsMotivation: Modern visual generation models struggle with structured visuals that require composition planning, text rendering, and multimodal reasoning for factual accuracy, creating a gap in AI capabilities.

Method: Constructed 1.3M structured image pairs from executable programs, trained unified VLM-FLUX.1 model with lightweight connector using three-stage curriculum, and developed StructBench benchmark with StructScore metric.

Result: Evaluation of 15 models shows even leading closed-source systems perform poorly, while their model achieves strong editing performance with consistent gains from inference-time reasoning across architectures.

Conclusion: The released dataset, model, and benchmark advance unified multimodal foundations for structured visual generation and editing, addressing a critical gap in current AI capabilities.

Abstract: While modern visual generation models excel at creating aesthetically pleasing natural images, they struggle with producing or editing structured visuals like charts, diagrams, and mathematical figures, which demand composition planning, text rendering, and multimodal reasoning for factual fidelity. To address this, we present the first comprehensive, systematic investigation of this domain, encompassing data construction, model training, and an evaluation benchmark. First, we construct a large-scale dataset of 1.3 million high-quality structured image pairs derived from executable drawing programs and augmented with chain-of-thought reasoning annotations. Building on it, we train a unified model that integrates a VLM with FLUX.1 Kontext via a lightweight connector for enhanced multimodal understanding. A three-stage training curriculum enables progressive feature alignment, knowledge infusion, and reasoning-augmented generation, further boosted by an external reasoner at inference time. Finally, we introduce StructBench, a novel benchmark for generation and editing with over 1,700 challenging instances, and an accompanying evaluation metric, StructScore, which employs a multi-round Q&A protocol to assess fine-grained factual accuracy. Evaluations of 15 models reveal that even leading closed-source systems remain far from satisfactory. Our model attains strong editing performance, and inference-time reasoning yields consistent gains across diverse architectures. By releasing the dataset, model, and benchmark, we aim to advance unified multimodal foundations for structured visuals.

[392] Character Mixing for Video Generation

Tingting Liao, Chongjian Ge, Guangyi Liu, Hao Li, Yi Zhou

Main category: cs.CV

TL;DR: A framework for generating videos where characters from different worlds interact naturally while preserving their identities and styles, using Cross-Character Embedding and Cross-Character Augmentation techniques.

DetailsMotivation: To enable natural interactions between characters from different contexts (like Mr. Bean and Tom & Jerry) without losing their original identities or causing style delusion where realistic characters appear cartoonish or vice versa.

Method: Uses Cross-Character Embedding (CCE) to learn identity and behavioral logic across multimodal sources, and Cross-Character Augmentation (CCA) to enrich training with synthetic co-existence and mixed-style data.

Result: Experiments on a curated benchmark with 10 characters from cartoons and live-action series show clear improvements in identity preservation, interaction quality, and robustness to style delusion.

Conclusion: The framework enables new forms of generative storytelling by allowing natural interactions between previously uncoexistent characters while maintaining stylistic fidelity.

Abstract: Imagine Mr. Bean stepping into Tom and Jerry–can we generate videos where characters interact naturally across different worlds? We study inter-character interaction in text-to-video generation, where the key challenge is to preserve each character’s identity and behaviors while enabling coherent cross-context interaction. This is difficult because characters may never have coexisted and because mixing styles often causes style delusion, where realistic characters appear cartoonish or vice versa. We introduce a framework that tackles these issues with Cross-Character Embedding (CCE), which learns identity and behavioral logic across multimodal sources, and Cross-Character Augmentation (CCA), which enriches training with synthetic co-existence and mixed-style data. Together, these techniques allow natural interactions between previously uncoexistent characters without losing stylistic fidelity. Experiments on a curated benchmark of cartoons and live-action series with 10 characters show clear improvements in identity preservation, interaction quality, and robustness to style delusion, enabling new forms of generative storytelling.Additional results and videos are available on our project page: https://tingtingliao.github.io/mimix/.

[393] VChain: Chain-of-Visual-Thought for Reasoning in Video Generation

Ziqi Huang, Ning Yu, Gordon Chen, Haonan Qiu, Paul Debevec, Ziwei Liu

Main category: cs.CV

TL;DR: VChain is a novel inference-time chain-of-visual-thought framework that uses large multimodal models to generate critical keyframes, which guide sparse tuning of video generators to improve complex dynamic synthesis.

DetailsMotivation: Current video generation models struggle with complex dynamics and coherent chains of consequences, while multimodal models have strong visual state reasoning capabilities that could enhance video generation.

Method: VChain leverages large multimodal models to generate sparse critical keyframes as snapshots, then uses these to guide sparse inference-time tuning of pre-trained video generators only at key moments.

Result: Extensive experiments on complex, multi-step scenarios show that VChain significantly enhances the quality of generated videos.

Conclusion: The approach is tuning-efficient, introduces minimal overhead, avoids dense supervision, and effectively bridges multimodal reasoning with video generation.

Abstract: Recent video generation models can produce smooth and visually appealing clips, but they often struggle to synthesize complex dynamics with a coherent chain of consequences. Accurately modeling visual outcomes and state transitions over time remains a core challenge. In contrast, large language and multimodal models (e.g., GPT-4o) exhibit strong visual state reasoning and future prediction capabilities. To bridge these strengths, we introduce VChain, a novel inference-time chain-of-visual-thought framework that injects visual reasoning signals from multimodal models into video generation. Specifically, VChain contains a dedicated pipeline that leverages large multimodal models to generate a sparse set of critical keyframes as snapshots, which are then used to guide the sparse inference-time tuning of a pre-trained video generator only at these key moments. Our approach is tuning-efficient, introduces minimal overhead and avoids dense supervision. Extensive experiments on complex, multi-step scenarios show that VChain significantly enhances the quality of generated videos.

[394] Foveated Retinotopy Improves Classification and Localization in CNNs

Jean-Nicolas Jérémie, Emmanuel Daucé, Laurent U Perrinet

Main category: cs.CV

TL;DR: Incorporating foveated retinotopic mapping into CNNs maintains classification accuracy while improving robustness to scale/rotation perturbations and enabling object localization through gaze position variations.

DetailsMotivation: Biological vision systems use foveated retinotopy for efficient visual processing, but this spatial organization remains largely unexplored in machine learning despite its potential benefits.

Method: Implemented foveated retinotopic transformation in the input layer of standard ResNet models and retrained them for image classification tasks.

Result: Maintained comparable classification accuracy while enhancing robustness to scale and rotational perturbations. Variations in classification probabilities across gaze positions served as effective indicators for object localization.

Conclusion: Foveated retinotopic mapping encodes implicit knowledge about visual object geometry and offers an efficient solution to visual search problems, similar to biological vision systems.

Abstract: From a falcon detecting prey to humans recognizing faces, many species exhibit extraordinary abilities in rapid visual localization and classification. These are made possible by a specialized retinal region called the fovea, which provides high acuity at the center of vision while maintaining lower resolution in the periphery. This distinctive spatial organization, preserved along the early visual pathway through retinotopic mapping, is fundamental to biological vision, yet remains largely unexplored in machine learning. Our study investigates how incorporating foveated retinotopy may benefit deep convolutional neural networks (CNNs) in image classification tasks. By implementing a foveated retinotopic transformation in the input layer of standard ResNet models and re-training them, we maintain comparable classification accuracy while enhancing the network’s robustness to scale and rotational perturbations. Although this architectural modification introduces increased sensitivity to fixation point shifts, we demonstrate how this apparent limitation becomes advantageous: variations in classification probabilities across different gaze positions serve as effective indicators for object localization. Our findings suggest that foveated retinotopic mapping encodes implicit knowledge about visual object geometry, offering an efficient solution to the visual search problem - a capability crucial for many living species.

[395] Interactive Test-Time Adaptation with Reliable Spatial-Temporal Voxels for Multi-Modal Segmentation

Haozhi Cao, Yuecong Xu, Pengyu Yin, Xingyu Ji, Shenghai Yuan, Jianfei Yang, Lihua Xie

Main category: cs.CV

TL;DR: Latte++ improves temporal consistency in multi-modal test-time adaptation for 3D segmentation using multi-window aggregation, while ITTA introduces human-in-the-loop feedback via interactive segmentation to handle consistently incorrect predictions.

DetailsMotivation: Previous MM-TTA methods for 3D segmentation suffer from unstable frame-wise predictions due to temporal inconsistency and consistently incorrect predictions that violate the assumption of reliable modality guidance.

Method: Two-fold framework: 1) Latte++ uses multi-window aggregation for better geometric correspondences and local prediction consistency evaluation, 2) ITTA employs interactive segmentation with point clicks and bounding boxes, using a lightweight promptable branch with momentum gradient module to capture human feedback.

Result: Extensive experiments across five MM-TTA benchmarks show consistent and notable improvements with robust performance gains for target classes in challenging imbalanced scenarios, with Latte++ providing complementary benefits for temporal stability.

Conclusion: The proposed framework effectively addresses both temporal inconsistency and consistently incorrect predictions in MM-TTA, achieving robust performance through geometric correspondences and human-in-the-loop feedback.

Abstract: Multi-modal test-time adaptation (MM-TTA) adapts models to an unlabeled target domain by leveraging the complementary multi-modal inputs in an online manner. While previous MM-TTA methods for 3D segmentation offer a promising solution by leveraging self-refinement per frame, they suffer from two major limitations: 1) unstable frame-wise predictions caused by temporal inconsistency, and 2) consistently incorrect predictions that violate the assumption of reliable modality guidance. To address these limitations, this work introduces a comprehensive two-fold framework. Firstly, building upon our previous work ReLiable Spatial-temporal Voxels (Latte), we propose Latte++ that better suppresses the unstable frame-wise predictions with more informative geometric correspondences. Instead of utilizing a universal sliding window, Latte++ employs multi-window aggregation to capture more reliable correspondences to better evaluate the local prediction consistency of different semantic categories. Secondly, to tackle the consistently incorrect predictions, we propose Interactive Test-Time Adaptation (ITTA), a flexible add-on to empower effortless human feedback with existing MM-TTA methods. ITTA introduces a novel human-in-the-loop approach that efficiently integrates minimal human feedback through interactive segmentation, requiring only simple point clicks and bounding box annotations. Instead of using independent interactive networks, ITTA employs a lightweight promptable branch with a momentum gradient module to capture and reuse knowledge from scarce human feedback during online inference. Extensive experiments across five MM-TTA benchmarks demonstrate that ITTA achieves consistent and notable improvements with robust performance gains for target classes of interest in challenging imbalanced scenarios, while Latte++ provides complementary benefits for temporal stability.

[396] Evaluating Perceptual Distance Models by Fitting Binomial Distributions to Two-Alternative Forced Choice Data

Alexander Hepburn, Raul Santos-Rodriguez, Javier Portilla

Main category: cs.CV

TL;DR: A new probabilistic method using maximum likelihood estimation is introduced to evaluate perceptual distance models with 2AFC data, overcoming limitations of existing approaches.

DetailsMotivation: The 2AFC paradigm has advantages over MOS but makes direct model evaluation difficult when comparisons lack shared images, and existing neural network approaches have conceptual limitations.

Method: Maximum likelihood estimation applied to a binomial decision model for a pure probabilistic approach to evaluate distance models using 2AFC psychophysical data.

Result: The method demonstrates superior simplicity, interpretability, flexibility, and computational efficiency compared to existing approaches.

Conclusion: The probabilistic maximum likelihood approach provides a robust and efficient alternative for evaluating perceptual distance models with 2AFC data.

Abstract: The Two Alternative Forced Choice (2AFC) paradigm offers advantages over the Mean Opinion Score (MOS) paradigm in psychophysics (PF), such as simplicity and robustness. However, when evaluating perceptual distance models, MOS enables direct correlation between model predictions and PF data. In contrast, 2AFC only allows pairwise comparisons to be converted into a quality ranking similar to MOS when comparisons include shared images. In large datasets, like BAPPS, where image patches and distortions are combined randomly, deriving rankings from 2AFC PF data becomes infeasible, as distorted images included in each comparisons are independent. To address this, instead of relying on MOS correlation, researchers have trained ad-hoc neural networks to reproduce 2AFC PF data based on pairs of model distances - a black-box approach with conceptual and operational limitations. This paper introduces a more robust distance-model evaluation method using a pure probabilistic approach, applying maximum likelihood estimation to a binomial decision model. Our method demonstrates superior simplicity, interpretability, flexibility, and computational efficiency, as shown through evaluations of various visual distance models on two 2AFC PF datasets.

[397] Reconstructing Topology-Consistent Face Mesh by Volume Rendering from Multi-View Images

Yating Wang, Ran Yi, Xiaoning Lei, Ke Fan, Jinkun Hao, Lizhuang Ma

Main category: cs.CV

TL;DR: A method combining explicit mesh with neural volume rendering to optimize geometry of template face meshes from multi-view images while maintaining topology consistency.

DetailsMotivation: High-quality 3D face reconstruction typically requires manual processing or specific capture settings, while NeRF shows advantages in reconstruction but lacks topology control.

Method: Derives density fields from meshes using distance fields, encodes radiance field in compact tri-planes, and introduces mesh-tailored adaptations to volume rendering for better convergence.

Result: Achieves superior reconstruction quality compared to previous approaches.

Conclusion: Validates the feasibility of integrating mesh and neural volume rendering for industrial 3D face asset creation.

Abstract: Industrial 3D face assets creation typically reconstructs topology-consistent face meshes from multi-view images for downstream production. However, high-quality reconstruction usually requires manual processing or specific capture settings. Recently NeRF has shown great advantages in 3D reconstruction, by representing scenes as density and radiance fields and utilizing neural volume rendering for novel view synthesis. Inspired by this, we introduce a novel method which combines explicit mesh with neural volume rendering to optimize geometry of an artist-made template face mesh from multi-view images while keeping the topology unchanged. Our method derives density fields from meshes using distance fields as an intermediary and encodes radiance field in compact tri-planes. To improve convergence, several adaptions tailored for meshes are introduced to the volume rendering. Experiments demonstrate that our method achieves superior reconstruction quality compared to previous approaches, validating the feasibility of integrating mesh and neural volume rendering.

[398] Capsule Network Projectors are Equivariant and Invariant Learners

Miles Everett, Aiden Durrant, Mingjun Zhong, Georgios Leontidis

Main category: cs.CV

TL;DR: CapsIE is a self-supervised architecture using Capsule Networks that learns both invariant and equivariant representations, achieving state-of-the-art performance on equivariant rotation tasks with higher efficiency and fewer parameters.

DetailsMotivation: Traditional self-supervised learning focuses on invariant representations, but recent work shows the importance of preserving equivariant properties. However, existing equivariant methods use highly prescribed architectures, limiting their flexibility.

Method: Proposes CapsIE architecture using Capsule Networks to capture equivariance with respect to novel viewpoints. Introduces a new objective function based on entropy minimization to accommodate CapsNet architectural changes.

Result: Achieves state-of-the-art performance on equivariant rotation tasks in 3DIEBench dataset compared to prior equivariant SSL methods. Performs competitively against supervised counterparts with higher efficiency and fewer network parameters.

Conclusion: CapsNets can learn complex and generalized representations for large-scale, multi-task datasets, demonstrating their effectiveness in equivariant self-supervised learning compared to previous benchmarks.

Abstract: Learning invariant representations has been the long-standing approach to self-supervised learning. However, recently progress has been made in preserving equivariant properties in representations, yet do so with highly prescribed architectures. In this work, we propose an invariant-equivariant self-supervised architecture that employs Capsule Networks (CapsNets), which have been shown to capture equivariance with respect to novel viewpoints. We demonstrate that the use of CapsNets in equivariant self-supervised architectures achieves improved downstream performance on equivariant tasks with higher efficiency and fewer network parameters. To accommodate the architectural changes of CapsNets, we introduce a new objective function based on entropy minimisation. This approach, which we name CapsIE (Capsule Invariant Equivariant Network), achieves state-of-the-art performance on the equivariant rotation tasks on the 3DIEBench dataset compared to prior equivariant SSL methods, while performing competitively against supervised counterparts. Our results demonstrate the ability of CapsNets to learn complex and generalised representations for large-scale, multi-task datasets compared to previous CapsNet benchmarks. Code is available at https://github.com/AberdeenML/CapsIE.

[399] Towards Cross-modal Backward-compatible Representation Learning for Vision-Language Models

Young Kyun Jang, Ser-nam Lim

Main category: cs.CV

TL;DR: This paper introduces Cross-modal Backward-compatible Training (XBT) to enable backward compatibility between Vision-Language Pretraining models for cross-modal retrieval, eliminating the need for costly backfilling when upgrading models.

DetailsMotivation: Modern retrieval systems face costly backfilling requirements when upgrading to new models due to embedding incompatibility between old and new models, particularly in cross-modal retrieval scenarios.

Method: Proposes a projection module that maps new model embeddings to old model embeddings, pretrained with text data only, and uses parameter-efficient training strategies to preserve the new model’s knowledge without modifications.

Result: Experimental results on cross-modal retrieval datasets demonstrate XBT’s effectiveness in achieving backward compatibility between VLP models like CLIP.

Conclusion: XBT enables backfill-free upgrades when new VLP models emerge, providing an efficient solution for cross-modal retrieval system upgrades.

Abstract: Modern retrieval systems often struggle with upgrading to new and more powerful models due to the incompatibility of embeddings between the old and new models. This necessitates a costly process known as backfilling, which involves re-computing the embeddings for a large number of data samples. In vision, Backward-compatible Training (BT) has been proposed to ensure that the new model aligns with the old model’s embeddings. This paper extends the concept of vision-only BT to the field of cross-modal retrieval, marking the first attempt to address Cross-modal BT (XBT). Our goal is to achieve backward-compatibility between Vision-Language Pretraining (VLP) models, such as CLIP, for the cross-modal retrieval task. To address XBT challenges, we propose an efficient solution: a projection module that maps the new model’s embeddings to those of the old model. This module, pretrained solely with text data, significantly reduces the number of image-text pairs required for XBT learning, and, once it is pretrained, it avoids using the old model during training. Furthermore, we utilize parameter-efficient training strategies that improve efficiency and preserve the off-the-shelf new model’s knowledge by avoiding any modifications. Experimental results on cross-modal retrieval datasets demonstrate the effectiveness of XBT and its potential to enable backfill-free upgrades when a new VLP model emerges.

[400] How Far are AI-generated Videos from Simulating the 3D Visual World: A Learned 3D Evaluation Approach

Chirui Chang, Jiahui Liu, Zhengzhe Liu, Xiaoyang Lyu, Yi-Hua Huang, Xin Tao, Pengfei Wan, Di Zhang, Xiaojuan Qi

Main category: cs.CV

TL;DR: L3DE is a method to objectively evaluate 3D visual coherence in AI-generated videos without manual annotations, using a 3D CNN trained on monocular 3D cues to distinguish real from synthetic content.

DetailsMotivation: Current video diffusion models generate photorealistic videos with good 3D consistency, but there's no systematic way to quantify how well they simulate the 3D visual world without relying on manual defect labeling.

Method: Uses a 3D convolutional network trained on monocular 3D cues (motion, depth, appearance) to distinguish real from synthetic videos, avoiding unreliable 3D reconstruction. Provides confidence scores and gradient-based visualizations.

Result: L3DE shows strong alignment with 3D reconstruction quality and human judgments. Evaluations reveal persistent simulation gaps and subtle inconsistencies in leading models like Kling, Sora, and MiniMax.

Conclusion: L3DE provides an objective, interpretable framework for assessing 3D visual coherence in generated videos, with applications in model benchmarking, deepfake detection, and video synthesis improvement.

Abstract: Recent advancements in video diffusion models enable the generation of photorealistic videos with impressive 3D consistency and temporal coherence. However, the extent to which these AI-generated videos simulate the 3D visual world remains underexplored. In this paper, we introduce Learned 3D Evaluation (L3DE), an objective, quantifiable, and interpretable method for assessing AI-generated videos’ ability to simulate the real world in terms of 3D visual qualities and consistencies, without requiring manually labeled defects or quality annotations. Instead of relying on 3D reconstruction, which is prone to failure with in-the-wild videos, L3DE employs a 3D convolutional network, trained on monocular 3D cues of motion, depth, and appearance, to distinguish real from synthetic videos. Confidence scores from L3DE quantify the gap between real and synthetic videos in terms of 3D visual coherence, while a gradient-based visualization pinpoints unrealistic regions, improving interpretability. We validate L3DE through extensive experiments, demonstrating strong alignment with 3D reconstruction quality and human judgments. Our evaluations on leading generative models (e.g., Kling, Sora, and MiniMax) reveal persistent simulation gaps and subtle inconsistencies. Beyond generative video assessment, L3DE extends to broader applications: benchmarking video generation models, serving as a deepfake detector, and enhancing video synthesis by inpainting flagged inconsistencies. Project page: https://justin-crchang.github.io/l3de-project-page/

[401] PanDORA: Casual HDR Radiance Acquisition for Indoor Scenes

Mohammad Reza Karimi Dastjerdi, Dominique Tanguay-Gaudreau, Frédéric Fortier-Chouinard, Yannick Hold-Geoffroy, Claude Demers, Nima Kalantari, Jean-François Lalonde

Main category: cs.CV

TL;DR: PanDORA is a system that uses two 360° cameras on a monopod to capture HDR indoor environments by simultaneously recording standard and fast-exposure panoramic videos, processed through a two-stage NeRF algorithm for high-quality HDR radiance mapping.

DetailsMotivation: Most view synthesis methods like NeRF fail to capture true HDR radiance due to reliance on LDR images from conventional cameras, and exposure bracketing techniques are too time-consuming for practical acquisition.

Method: Uses two 360° cameras on a portable monopod to simultaneously record standard exposure and fast shutter speed panoramic videos, processed by a two-stage NeRF-based algorithm with fine alignment of fast- and well-exposed frames.

Result: PanDORA generates non-saturated HDR radiance maps and achieves superior visual fidelity compared to existing methods on real indoor scenes with HDR ground truth lighting.

Conclusion: The system provides a scalable solution for capturing real environments in HDR with casual, high-quality acquisition.

Abstract: Most novel view synthesis methods-including Neural Radiance Fields (NeRF)-struggle to capture the true high dynamic range (HDR) radiance of scenes. This is primarily due to their dependence on low dynamic range (LDR) images from conventional cameras. Exposure bracketing techniques aim to address this challenge, but they introduce a considerable time burden during the acquisition process. In this work, we introduce PanDORA: PANoramic Dual-Observer Radiance Acquisition, a system designed for the casual, high quality HDR capture of indoor environments. Our approach uses two 360{\deg} cameras mounted on a portable monopod to simultaneously record two panoramic 360{\deg} videos: one with standard exposure and another at fast shutter speed. The resulting video data is processed by a proposed two-stage NeRF-based algorithm, including an algorithm for the fine alignment of the fast- and well-exposed frames, generating non-saturated HDR radiance maps. Compared to existing methods on a novel dataset of real indoor scenes captured with our apparatus and including HDR ground truth lighting, PanDORA achieves superior visual fidelity and provides a scalable solution for capturing real environments in HDR.

[402] A Survey of Defenses Against AI-Generated Visual Media: Detection,Disruption, and Authentication

Jingyi Deng, Chenhao Lin, Zhengyu Zhao, Shuai Liu, Zhe Peng, Qian Wang, Chao Shen

Main category: cs.CV

TL;DR: A systematic review of defense methods against AI-generated visual media, covering detection, disruption, and authentication strategies within a unified passive and proactive framework.

DetailsMotivation: Deep generative models have impressive capabilities but can be misused for malicious purposes like misinformation, deception, and copyright violation, necessitating effective defense mechanisms.

Method: The paper provides a comprehensive review organizing defense strategies into detection, disruption, and authentication categories, proposing a multidimensional taxonomy and analyzing evaluation datasets, criteria, and metrics.

Result: The review establishes a unified framework for understanding defense mechanisms against AI-generated visual media and identifies mainstream defense-related tasks and their characteristics.

Conclusion: The analysis reveals current research challenges and suggests potential future directions for improving defenses against malicious use of AI-generated visual content.

Abstract: Deep generative models have demonstrated impressive performance in various computer vision applications, including image synthesis, video generation, and medical analysis. Despite their significant advancements, these models may be used for malicious purposes, such as misinformation, deception, and copyright violation. In this paper, we provide a systematic and timely review of research efforts on defenses against AI-generated visual media, covering detection, disruption, and authentication. We review existing methods and summarize the mainstream defense-related tasks within a unified passive and proactive framework. Moreover, we survey the derivative tasks concerning the trustworthiness of defenses, such as their robustness and fairness. For each defense strategy, we formulate its general pipeline and propose a multidimensional taxonomy applicable across defense tasks, based on methodological strategies. Additionally, we summarize the commonly used evaluation datasets, criteria, and metrics. Finally, by analyzing the reviewed studies, we provide insights into current research challenges and suggest possible directions for future research.

[403] ShapeICP: Iterative Category-level Object Pose and Shape Estimation from Depth

Yihao Zhang, Harpreet S. Sawhney, John J. Leonard

Main category: cs.CV

TL;DR: ShapeICP is a novel iterative method for category-level object pose and shape estimation from single depth images that doesn’t require pose-annotated training data, using a mesh-based active shape model and outperforming many data-driven approaches.

DetailsMotivation: Category-level object pose and shape estimation is challenging due to compounded unknowns (pose, shape, correspondences) from single depth images. Prior data-driven methods risk generalization failures and use limited shape representations like point clouds and SDFs.

Method: Uses iterative estimation based on ICP algorithm with a novel mesh-based active shape model that maintains vertex connectivity. Does not require learning from pose-annotated data.

Result: ShapeICP surpasses many data-driven approaches that rely on pose data for training, despite not using pose-annotated data.

Conclusion: Opens up a new solution space for category-level pose and shape estimation by demonstrating that non-data-driven iterative methods can outperform data-driven approaches.

Abstract: Category-level object pose and shape estimation from a single depth image has recently drawn research attention due to its potential utility for tasks such as robotics manipulation. The task is particularly challenging because the three unknowns, object pose, object shape, and model-to-measurement correspondences, are compounded together, but only a single view of depth measurements is provided. Most of the prior work heavily relies on data-driven approaches to obtain solutions to at least one of the unknowns, and typically two, risking generalization failures if not designed and trained carefully. The shape representations used in the prior work also mainly focus on point clouds and signed distance fields (SDFs). In stark contrast to the prior work, we approach the problem using an iterative estimation method that does not require learning from pose-annotated data. Moreover, we construct and adopt a novel mesh-based object active shape model (ASM), which additionally maintains vertex connectivity compared to the commonly used point-based object ASM. Our algorithm, ShapeICP, is based on the iterative closest point (ICP) algorithm but is equipped with additional features for the category-level pose and shape estimation task. Although not using pose-annotated data, ShapeICP surpasses many data-driven approaches that rely on pose data for training, opening up a new solution space for researchers to consider.

[404] Law of Vision Representation in MLLMs

Shijia Yang, Bohan Zhai, Quanzeng You, Jianbo Yuan, Hongxia Yang, Chenfeng Xu

Main category: cs.CV

TL;DR: The paper presents the ‘Law of Vision Representation’ showing strong correlation between cross-modal alignment, correspondence in vision representation, and MLLM performance, quantified by AC score.

DetailsMotivation: To understand the relationship between vision representation factors and MLLM performance, enabling more efficient model training.

Method: Quantify cross-modal alignment and correspondence using AC score, conduct experiments with 13 vision representation settings across 8 benchmarks.

Result: AC score shows linear correlation with model performance, enabling identification of optimal vision representation without language model finetuning.

Conclusion: The proposed approach achieves 99.7% reduction in computational cost while maintaining performance through optimal vision representation selection.

Abstract: We present the “Law of Vision Representation” in multimodal large language models (MLLMs). It reveals a strong correlation between the combination of cross-modal alignment, correspondence in vision representation, and MLLM performance. We quantify the two factors using the cross-modal Alignment and Correspondence score (AC score). Through extensive experiments involving thirteen different vision representation settings and evaluations across eight benchmarks, we find that the AC score is linearly correlated to model performance. By leveraging this relationship, we are able to identify and train the optimal vision representation only, which does not require finetuning the language model every time, resulting in a 99.7% reduction in computational cost.

[405] SteerDiff: Steering towards Safe Text-to-Image Diffusion Models

Hongxiang Zhang, Yifeng He, Hao Chen

Main category: cs.CV

TL;DR: SteerDiff is a lightweight adaptor module that prevents inappropriate content generation in text-to-image diffusion models by manipulating text embeddings to steer outputs toward ethical standards.

DetailsMotivation: Existing safety measures for T2I diffusion models are insufficient - text classifiers can be bypassed and fine-tuning safeguards becomes challenging as models scale. Recent red-teaming attacks highlight the need for better content prevention.

Method: SteerDiff acts as an intermediary between user input and diffusion model, identifying and manipulating inappropriate concepts within the text embedding space to guide the model away from harmful outputs.

Result: Extensive experiments across concept unlearning tasks show SteerDiff effectively prevents inappropriate content generation. It demonstrates robustness against multiple red-teaming strategies and versatility in concept forgetting tasks.

Conclusion: SteerDiff provides an effective and flexible safety mechanism for T2I diffusion models that maintains usability while preventing harmful content generation through text embedding manipulation.

Abstract: Text-to-image (T2I) diffusion models have drawn attention for their ability to generate high-quality images with precise text alignment. However, these models can also be misused to produce inappropriate content. Existing safety measures, which typically rely on text classifiers or ControlNet-like approaches, are often insufficient. Traditional text classifiers rely on large-scale labeled datasets and can be easily bypassed by rephrasing. As diffusion models continue to scale, fine-tuning these safeguards becomes increasingly challenging and lacks flexibility. Recent red-teaming attack researches further underscore the need for a new paradigm to prevent the generation of inappropriate content. In this paper, we introduce SteerDiff, a lightweight adaptor module designed to act as an intermediary between user input and the diffusion model, ensuring that generated images adhere to ethical and safety standards with little to no impact on usability. SteerDiff identifies and manipulates inappropriate concepts within the text embedding space to guide the model away from harmful outputs. We conduct extensive experiments across various concept unlearning tasks to evaluate the effectiveness of our approach. Furthermore, we benchmark SteerDiff against multiple red-teaming strategies to assess its robustness. Finally, we explore the potential of SteerDiff for concept forgetting tasks, demonstrating its versatility in text-conditioned image generation.

[406] Fast constrained sampling in pre-trained diffusion models

Alexandros Graikos, Nebojsa Jojic, Dimitris Samaras

Main category: cs.CV

TL;DR: A fast constrained sampling algorithm for diffusion models that avoids expensive backpropagation while maintaining high-quality generation under various constraints like inpainting and style-guided generation.

DetailsMotivation: Large pre-trained diffusion models have general image knowledge but are inefficient and unreliable for constrained sampling tasks. Existing methods are either too slow (using backpropagation) or fail to capture long-range correlations.

Method: Uses an approximation to Newton’s optimization method in denoising diffusion models to speed up inference and avoid expensive backpropagation operations.

Result: Produces results that rival or surpass state-of-the-art training-free inference methods while requiring significantly less time. Effective for both linear (inpainting, super-resolution) and non-linear (style-guided generation) constraints.

Conclusion: The proposed algorithm enables fast, high-quality constrained sampling in diffusion models, making it practical for various image generation tasks with constraints.

Abstract: Large denoising diffusion models, such as Stable Diffusion, have been trained on billions of image-caption pairs to perform text-conditioned image generation. As a byproduct of this training, these models have acquired general knowledge about image statistics, which can be useful for other inference tasks. However, when confronted with sampling an image under new constraints, e.g. generating the missing parts of an image, using large pre-trained text-to-image diffusion models is inefficient and often unreliable. Previous approaches either utilized backpropagation through the denoiser network, making them significantly slower and more memory-demanding than simple text-to-image generation, or only enforced the constraint locally, failing to capture critical long-range correlations in the sampled image. In this work, we propose an algorithm that enables fast, high-quality generation under arbitrary constraints. We show that in denoising diffusion models, we can employ an approximation to Newton’s optimization method that allows us to speed up inference and avoid the expensive backpropagation operations. Our approach produces results that rival or surpass the state-of-the-art training-free inference methods while requiring a fraction of the time. We demonstrate the effectiveness of our algorithm under both linear (inpainting, super-resolution) and non-linear (style-guided generation) constraints. An implementation is provided at https://github.com/cvlab-stonybrook/fast-constrained-sampling.

[407] Probabilistic Language-Image Pre-Training

Sanghyuk Chun, Wonjae Kim, Song Park, Sangdoo Yun

Main category: cs.CV

TL;DR: ProLIP is the first probabilistic vision-language model pre-trained on billion-scale data using only probabilistic objectives, achieving strong zero-shot performance and uncertainty estimation without extra parameters.

DetailsMotivation: Current VLMs use deterministic embeddings that assume one-to-one image-text correspondence, but real-world relationships are many-to-many (multiple captions per image and vice versa).

Method: Uses probabilistic objectives with an “uncertainty token” for efficient uncertainty estimation, and introduces a novel inclusion loss that enforces distributional inclusion relationships between image-text pairs and between original/masked inputs.

Result: Achieves 74.6% ImageNet zero-shot accuracy with ViT-B/16, and uncertainty estimates align with intuitive notions (shorter texts more uncertain, general inputs include specific ones). Text uncertainties further improve ImageNet accuracy to 75.8% in few-shot setting.

Conclusion: ProLIP demonstrates practical advantages of probabilistic approaches for VLMs, enabling better uncertainty estimation and improved downstream task performance.

Abstract: Vision-language models (VLMs) embed aligned image-text pairs into a joint space but often rely on deterministic embeddings, assuming a one-to-one correspondence between images and texts. This oversimplifies real-world relationships, which are inherently many-to-many, with multiple captions describing a single image and vice versa. We introduce Probabilistic Language-Image Pre-training (ProLIP), the first probabilistic VLM pre-trained on a billion-scale image-text dataset using only probabilistic objectives, achieving a strong zero-shot capability (e.g., 74.6% ImageNet zero-shot accuracy with ViT-B/16). ProLIP efficiently estimates uncertainty by an “uncertainty token” without extra parameters. We also introduce a novel inclusion loss that enforces distributional inclusion relationships between image-text pairs and between original and masked inputs. Experiments demonstrate that, by leveraging uncertainty estimates, ProLIP benefits downstream tasks and aligns with intuitive notions of uncertainty, e.g., shorter texts being more uncertain and more general inputs including specific ones. Utilizing text uncertainties, we further improve ImageNet accuracy from 74.6% to 75.8% (under a few-shot setting), supporting the practical advantages of our probabilistic approach. The code is available at https://github.com/naver-ai/prolip

[408] CoralSCOP-LAT: Labeling and Analyzing Tool for Coral Reef Images with Dense Mask

Yuk-Kwan Wong, Ziqiang Zheng, Mingzhe Zhang, David Suggett, Sai-Kit Yeung

Main category: cs.CV

TL;DR: CoralSCOP-LAT is an automated coral reef image analysis tool that uses machine learning to segment and analyze coral regions, improving labeling efficiency and precision compared to existing methods.

DetailsMotivation: Current semi-automated coral reef image analysis platforms face fundamental limitations in handling rapidly expanding image datasets for ecosystem monitoring.

Method: Leverages advanced machine learning models specifically tailored for coral reef segmentation to automatically generate dense segmentation masks with minimal manual effort.

Result: Extensive evaluations show CoralSCOP-LAT surpasses existing tools in time efficiency, accuracy, precision, and flexibility, significantly accelerating the coral reef annotation process.

Conclusion: CoralSCOP-LAT provides an efficient solution for obtaining high-quality coral reef segmentation and analysis outcomes, addressing key challenges in coral reef monitoring.

Abstract: Coral reef imagery offers critical data for monitoring ecosystem health, in particular as the ease of image datasets continues to rapidly expand. Whilst semi-automated analytical platforms for reef imagery are becoming more available, the dominant approaches face fundamental limitations. To address these challenges, we propose CoralSCOP-LAT, a coral reef image analysis and labeling tool that automatically segments and analyzes coral regions. By leveraging advanced machine learning models tailored for coral reef segmentation, CoralSCOP-LAT enables users to generate dense segmentation masks with minimal manual effort, significantly enhancing both the labeling efficiency and precision of coral reef analysis. Our extensive evaluations demonstrate that CoralSCOP-LAT surpasses existing coral reef analysis tools in terms of time efficiency, accuracy, precision, and flexibility. CoralSCOP-LAT, therefore, not only accelerates the coral reef annotation process but also assists users in obtaining high-quality coral reef segmentation and analysis outcomes. Github Page: https://github.com/ykwongaq/CoralSCOP-LAT.

[409] SEE-DPO: Self Entropy Enhanced Direct Preference Optimization

Shivanshu Shekhar, Shreyas Singh, Tong Zhang

Main category: cs.CV

TL;DR: The paper introduces a self-entropy regularization method to stabilize DPO training for diffusion models, addressing overfitting and reward hacking issues.

DetailsMotivation: DPO-based methods for aligning diffusion models with human preferences suffer from overfitting and reward hacking during prolonged training, especially when models are optimized on out-of-distribution data.

Method: Proposes a self-entropy regularization mechanism integrated with reinforcement learning from human feedback to encourage broader exploration and improve training robustness.

Result: The regularization technique effectively mitigates reward hacking, improves training stability, enhances image quality across latent space, and boosts image diversity and specificity.

Conclusion: Integrating human feedback with self-entropy regularization achieves state-of-the-art results on key image generation metrics by stabilizing DPO training for diffusion models.

Abstract: Direct Preference Optimization (DPO) has been successfully used to align large language models (LLMs) according to human preferences, and more recently it has also been applied to improving the quality of text-to-image diffusion models. However, DPO-based methods such as SPO, Diffusion-DPO, and D3PO are highly susceptible to overfitting and reward hacking, especially when the generative model is optimized to fit out-of-distribution during prolonged training. To overcome these challenges and stabilize the training of diffusion models, we introduce a self-entropy regularization mechanism in reinforcement learning from human feedback. This enhancement improves DPO training by encouraging broader exploration and greater robustness. Our regularization technique effectively mitigates reward hacking, leading to improved stability and enhanced image quality across the latent space. Extensive experiments demonstrate that integrating human feedback with self-entropy regularization can significantly boost image diversity and specificity, achieving state-of-the-art results on key image generation metrics.

[410] TimeFormer: Capturing Temporal Relationships of Deformable 3D Gaussians for Robust Reconstruction

DaDong Jiang, Zhihui Ke, Xiaobo Zhou, Zhi Hou, Xianghui Yang, Wenbo Hu, Tie Qiu, Chunchao Guo

Main category: cs.CV

TL;DR: TimeFormer is a plug-and-play module that enhances dynamic scene reconstruction by learning temporal relationships between 3D Gaussians, enabling better handling of complex motions while maintaining original rendering speed during inference.

DetailsMotivation: Existing dynamic scene reconstruction methods learn motion changes independently from individual timestamps, making them struggle with complex scenes involving violent movement, extreme geometries, or reflective surfaces.

Method: TimeFormer uses a Cross-Temporal Transformer Encoder to learn temporal relationships of deformable 3D Gaussians, combined with a two-stream optimization strategy that transfers motion knowledge during training but removes TimeFormer during inference.

Result: Extensive experiments in multi-view and monocular dynamic scenes show qualitative and quantitative improvements in reconstruction quality.

Conclusion: TimeFormer effectively addresses limitations in current dynamic scene reconstruction by learning implicit motion patterns while preserving rendering efficiency during inference.

Abstract: Dynamic scene reconstruction is a long-term challenge in 3D vision. Recent methods extend 3D Gaussian Splatting to dynamic scenes via additional deformation fields and apply explicit constraints like motion flow to guide the deformation. However, they learn motion changes from individual timestamps independently, making it challenging to reconstruct complex scenes, particularly when dealing with violent movement, extreme-shaped geometries, or reflective surfaces. To address the above issue, we design a plug-and-play module called TimeFormer to enable existing deformable 3D Gaussians reconstruction methods with the ability to implicitly model motion patterns from a learning perspective. Specifically, TimeFormer includes a Cross-Temporal Transformer Encoder, which adaptively learns the temporal relationships of deformable 3D Gaussians. Furthermore, we propose a two-stream optimization strategy that transfers the motion knowledge learned from TimeFormer to the base stream during the training phase. This allows us to remove TimeFormer during inference, thereby preserving the original rendering speed. Extensive experiments in the multi-view and monocular dynamic scenes validate qualitative and quantitative improvement brought by TimeFormer. Project Page: https://patrickddj.github.io/TimeFormer/

[411] Grounding-IQA: Grounding Multimodal Language Model for Image Quality Assessment

Zheng Chen, Xun Zhang, Wenbo Li, Renjing Pei, Fenglong Song, Xiongkuo Min, Xiaohong Liu, Xin Yuan, Yong Guo, Yulun Zhang

Main category: cs.CV

TL;DR: The paper introduces grounding-IQA, a new paradigm that integrates multimodal referring and grounding with image quality assessment to enable more fine-grained quality perception through location-aware descriptions and visual question answering.

DetailsMotivation: Existing MLLM-based IQA methods rely on general contextual descriptions, which limits fine-grained quality assessment capabilities. The authors aim to address this limitation by incorporating precise location information into quality evaluation.

Method: Proposed grounding-IQA paradigm with two subtasks: GIQA-DES (detailed descriptions with bounding boxes) and GIQA-VQA (quality QA for local regions). Created GIQA-160K dataset through automated annotation pipeline and developed GIQA-Bench benchmark for evaluation.

Result: Experiments demonstrate that the proposed method facilitates more fine-grained IQA applications by enabling detailed quality assessments with precise location information.

Conclusion: Grounding-IQA extends existing IQA capabilities by integrating multimodal referring and grounding, providing a more comprehensive framework for fine-grained image quality assessment through location-aware descriptions and region-specific quality evaluation.

Abstract: The development of multimodal large language models (MLLMs) enables the evaluation of image quality through natural language descriptions. This advancement allows for more detailed assessments. However, these MLLM-based IQA methods primarily rely on general contextual descriptions, sometimes limiting fine-grained quality assessment. To address this limitation, we introduce a new image quality assessment (IQA) task paradigm, grounding-IQA. This paradigm integrates multimodal referring and grounding with IQA to realize more fine-grained quality perception, thereby extending existing IQA. Specifically, grounding-IQA comprises two subtasks: grounding-IQA-description (GIQA-DES) and visual question answering (GIQA-VQA). GIQA-DES involves detailed descriptions with precise locations (e.g., bounding boxes), while GIQA-VQA focuses on quality QA for local regions. To realize grounding-IQA, we construct a corresponding dataset, GIQA-160K, through our proposed automated annotation pipeline. Furthermore, we develop a well-designed benchmark, GIQA-Bench. The benchmark evaluates the grounding-IQA performance from three perspectives: description quality, VQA accuracy, and grounding precision. Experiments demonstrate that our proposed method facilitates the more fine-grained IQA application. Code: https://github.com/zhengchen1999/Grounding-IQA.

[412] Autonomous Imagination: Closed-Loop Decomposition of Visual-to-Textual Conversion in Visual Reasoning for Multimodal Large Language Models

Jingming Liu, Yumeng Li, Boyuan Xiao, Yichang Jian, Ziang Qin, Tianjia Shao, Yao-Xiang Ding, Kun Zhou

Main category: cs.CV

TL;DR: MLLMs struggle with visual tasks like counting and puzzles due to perceptual bottlenecks. The paper proposes ‘autonomous imagination’ - iteratively modifying visual inputs through closed-loop visual modification steps to decompose visual-to-textual conversion.

DetailsMotivation: Multimodal LLMs fail at simple visual tasks despite LLMs' success in textual reasoning. The issue is visual-to-textual conversion bottlenecks that can't be solved by scaling reasoning alone.

Method: Autonomous imagination approach where MLLMs iteratively modify visual inputs (isolating objects, rearranging puzzle pieces) into intermediate visual states, decomposing visual-to-textual conversion into closed-loop visual modification steps.

Result: Without retraining, MLLMs can now solve tasks initially beyond their perceptual capability, demonstrating that closed-loop visual modification effectively decomposes visual reasoning into solvable substeps.

Conclusion: Closed-loop visual modification through autonomous imagination is an effective approach for decomposing complex visual reasoning tasks that MLLMs previously couldn’t handle due to perceptual limitations.

Abstract: Under pure textual modality, Large Language Models (LLMs) have demonstrated remarkable success in complex reasoning tasks by decomposing them into simpler sub-problems. However, Multimodal Large Language Models (MLLMs) still struggle with some seemingly straightforward visual tasks, such as counting and solving jigsaw puzzles. We argue that these tasks challenge the ability of visual-to-textual conversion, where MLLMs convert visual information perceived from the input scene, to textual information for further reasoning and generating the answer. If the complexity of the visual input is beyond the perceptual capability of the MLLMs, without decomposing this conversion process, simply scaling inference-time reasoning cannot solve the task because it repeatedly encounters the same perceptual bottleneck. We propose an approach, autonomous imagination, to enable MLLMs to iteratively modify visual inputs (e.g. isolating objects, rearranging puzzle pieces) into intermediate visual states, decomposing visual-to-textual conversion into closed-loop visual modification steps. We show that, without any retraining, MLLMs can now solve tasks initially beyond their perceptual capability, highlighting that closed-loop visual modification can be an effective way of decomposing the visual reasoning task into solvable substeps. Our code and data are released at https://future-item.github.io/autoimagine-site/.

[413] RowDetr: End-to-End Crop Row Detection Using Polynomials

Rahul Harsha Cheppally, Ajay Sharda

Main category: cs.CV

TL;DR: RowDetr is an efficient transformer-based neural network for crop row detection that uses polynomial representation to directly parameterize crop rows, eliminating post-processing and achieving real-time performance on edge devices.

DetailsMotivation: Vision-based crop row detection struggles with gaps, curved rows, and occlusions in under-canopy environments, while accurate labeling is difficult due to these challenges.

Method: Uses lightweight backbone with hybrid encoder, PolySampler module, multi-scale deformable attention, and PolyOptLoss energy-based loss function for geometric alignment optimization.

Result: Achieved F1 score up to 0.74, lane position deviation as low as 0.405, and real-time inference latency of 6.7ms (optimized to 3.5ms with INT8 quantization).

Conclusion: RowDetr demonstrates the efficiency of polynomial parameterization for crop row detection, making it suitable for deployment on edge computing devices in agricultural robotics.

Abstract: Crop row detection enables autonomous robots to navigate in gps denied environments. Vision based strategies often struggle in the environments due to gaps, curved crop rows and require post-processing steps. Furthermore, labeling crop rows in under the canopy environments accurately is very difficult due to occlusions. This study introduces RowDetr, an efficient end-to-end transformer-based neural network for crop row detection in precision agriculture. RowDetr leverages a lightweight backbone and a hybrid encoder to model straight, curved, or occluded crop rows with high precision. Central to the architecture is a novel polynomial representation that enables direct parameterization of crop rows, eliminating computationally expensive post-processing. Key innovations include a PolySampler module and multi-scale deformable attention, which work together with PolyOptLoss, an energy-based loss function designed to optimize geometric alignment between predicted and the annotated crop rows, while also enhancing robustness against labeling noise. RowDetr was evaluated against other state-of-the-art end-to-end crop row detection methods like AgroNav and RolColAttention on a diverse dataset of 6,962 high-resolution images, used for training, validation, and testing across multiple crop types with annotated crop rows. The system demonstrated superior performance, achieved an F1 score up to 0.74 and a lane position deviation as low as 0.405. Furthermore, RowDetr achieves a real-time inference latency of 6.7ms, which was optimized to 3.5ms with INT8 quantization on an NVIDIA Jetson Orin AGX. This work highlighted the critical efficiency of polynomial parameterization, making RowDetr particularly suitable for deployment on edge computing devices in agricultural robotics and autonomous farming equipment. Index terms > Crop Row Detection, Under Canopy Navigation, Transformers, RT-DETR, RT-DETRv2

[414] DynamicScaler: Seamless and Scalable Video Generation for Panoramic Scenes

Jinxiu Liu, Shaoheng Lin, Yinxiao Li, Ming-Hsuan Yang

Main category: cs.CV

TL;DR: DynamicScaler is a method for generating high-quality panoramic video content using a diffusion model with fixed resolution, addressing limitations of existing video diffusion models in scene-level dynamic content synthesis.

DetailsMotivation: The increasing demand for immersive AR/VR applications and spatial intelligence requires high-quality scene-level and 360° panoramic video generation, but current video diffusion models are constrained by limited resolution and aspect ratio.

Method: Proposes Offset Shifting Denoiser with seamless rotating Window for efficient synchronous denoising, and Global Motion Guidance mechanism to ensure local detail fidelity and global motion continuity.

Result: Achieves superior content and motion quality in panoramic scene-level video generation with constant VRAM consumption regardless of output video resolution.

Conclusion: DynamicScaler provides a training-free, efficient, and scalable solution for immersive dynamic scene creation that preserves coherence across panoramic scenes of arbitrary size.

Abstract: The increasing demand for immersive AR/VR applications and spatial intelligence has heightened the need to generate high-quality scene-level and 360${\deg}$ panoramic video. However, most video diffusion models are constrained by limited resolution and aspect ratio, which restricts their applicability to scene-level dynamic content synthesis. In this work, we propose $\textbf{DynamicScaler}$, addressing these challenges by enabling spatially scalable and panoramic dynamic scene synthesis that preserves coherence across panoramic scenes of arbitrary size. Specifically, we introduce a Offset Shifting Denoiser, facilitating efficient, synchronous, and coherent denoising panoramic dynamic scenes via a diffusion model with fixed resolution through a seamless rotating Window, which ensures seamless boundary transitions and consistency across the entire panoramic space, accommodating varying resolutions and aspect ratios. Additionally, we employ a Global Motion Guidance mechanism to ensure both local detail fidelity and global motion continuity. Extensive experiments demonstrate our method achieves superior content and motion quality in panoramic scene-level video generation, offering a training-free, efficient, and scalable solution for immersive dynamic scene creation with constant VRAM consumption regardless of the output video resolution. Project page is available at $\href{https://dynamic-scaler.pages.dev/new}{https://dynamic-scaler.pages.dev/new}$.

[415] Leveraging Confident Image Regions for Source-Free Domain-Adaptive Object Detection

Mohamed Lamine Mekhalfi, Davide Boscaini, Fabio Poiesi

Main category: cs.CV

TL;DR: A novel data augmentation method for source-free domain-adaptive object detection that cuts confident target regions, augments them with pseudo-labels, and combines them into challenging images, implemented via teacher-student learning.

DetailsMotivation: Address the lack of data augmentation schemes for source-free domain-adaptive object detection, where source data is unavailable during adaptation.

Method: Cut target image regions where detector is confident, augment them with pseudo-labels, join into challenging target images, and use teacher-student learning to prevent model collapse.

Result: Achieved state-of-the-art performance on two out of three traffic scene adaptation benchmarks.

Conclusion: The proposed data augmentation approach effectively adapts object detectors to target domains without source data access, demonstrating strong performance on traffic scene benchmarks.

Abstract: Source-free domain-adaptive object detection is an interesting but scarcely addressed topic. It aims at adapting a source-pretrained detector to a distinct target domain without resorting to source data during adaptation. So far, there is no data augmentation scheme tailored to source-free domain-adaptive object detection. To this end, this paper presents a novel data augmentation approach that cuts out target image regions where the detector is confident, augments them along with their respective pseudo-labels, and joins them into a challenging target image to adapt the detector. As the source data is out of reach during adaptation, we implement our approach within a teacher-student learning paradigm to ensure that the model does not collapse during the adaptation procedure. We evaluated our approach on three adaptation benchmarks of traffic scenes, scoring new state-of-the-art on two of them.

[416] Fast-RF-Shimming: Accelerate RF Shimming in 7T MRI using Deep Learning

Zhengyi Lu, Hao Liang, Ming Lu, Xiao Wang, Xinqiang Yan, Yuankai Huo

Main category: cs.CV

TL;DR: Fast-RF-Shimming is a machine learning framework that achieves 5000x speed-up over traditional RF shimming methods for addressing B1+ field inhomogeneities in ultrahigh field MRI, using ResNet-based mapping with confidence parameter integration and optional post-processing.

DetailsMotivation: Ultrahigh field MRI provides superior SNR and resolution but suffers from B1+ field inhomogeneities causing image artifacts. Traditional RF shimming methods like MLS are effective but time-consuming, while existing ML approaches face training time, complexity, and data requirement challenges.

Method: Three-phase approach: 1) Use Adam optimization to derive reference shimming weights from multi-channel B1+ fields, 2) Train ResNet to map B1+ fields directly to RF shimming outputs with confidence parameter in loss function, 3) Optional NFD post-processing to identify extreme non-uniform outcomes.

Result: Achieves 5000x speed-up compared to traditional MLS optimization while maintaining or improving predictive accuracy. Comparative evaluations show significant gains in both processing speed and accuracy.

Conclusion: Fast-RF-Shimming provides a promising solution for persistent B1+ inhomogeneity challenges in ultrahigh field MRI, offering substantial speed improvements without compromising performance.

Abstract: Ultrahigh field (UHF) Magnetic Resonance Imaging (MRI) offers an elevated signal-to-noise ratio (SNR), enabling exceptionally high spatial resolution that benefits both clinical diagnostics and advanced research. However, the jump to higher fields introduces complications, particularly transmit radiofrequency (RF) field ($B_{1}^{+}$) inhomogeneities, manifesting as uneven flip angles and image intensity irregularities. These artifacts can degrade image quality and impede broader clinical adoption. Traditional RF shimming methods, such as Magnitude Least Squares (MLS) optimization, effectively mitigate $B_{1}^{+}$ inhomogeneity, but remain time-consuming. Recent machine learning approaches, including RF Shim Prediction by Iteratively Projected Ridge Regression and other deep learning architectures, suggest alternative pathways. Although these approaches show promise, challenges such as extensive training periods, limited network complexity, and practical data requirements persist. In this paper, we introduce a holistic learning-based framework called Fast-RF-Shimming, which achieves a 5000x speed-up compared to the traditional MLS method. In the initial phase, we employ random-initialized Adaptive Moment Estimation (Adam) to derive the desired reference shimming weights from multi-channel $B_{1}^{+}$ fields. Next, we train a Residual Network (ResNet) to map $B_{1}^{+}$ fields directly to the ultimate RF shimming outputs, incorporating the confidence parameter into its loss function. Finally, we design Non-uniformity Field Detector (NFD), an optional post-processing step, to ensure the extreme non-uniform outcomes are identified. Comparative evaluations with standard MLS optimization underscore notable gains in both processing speed and predictive accuracy, which indicates that our technique shows a promising solution for addressing persistent inhomogeneity challenges.

[417] CBVLM: Training-free Explainable Concept-based Large Vision Language Models for Medical Image Classification

Cristiano Patrício, Isabel Rio-Torto, Jaime S. Cardoso, Luís F. Teixeira, João C. Neves

Main category: cs.CV

TL;DR: CBVLM is a zero-shot medical image classification method that uses Large Vision-Language Models to predict human-interpretable concepts and make final diagnoses, achieving explainability without training and reducing annotation costs.

DetailsMotivation: Address the challenges of limited annotated data and lack of interpretability in medical deep learning by leveraging LVLMs' few-shot capabilities while maintaining explainability through concept-based reasoning.

Method: Two-stage approach: 1) Prompt LVLM to detect predefined concepts in images, 2) Use concept predictions to classify images, with retrieval-based in-context learning for both stages.

Result: Outperforms Concept Bottleneck Models and task-specific supervised methods across four medical datasets using twelve LVLMs, without training and with minimal annotations.

Conclusion: CBVLM provides an effective solution for interpretable medical image analysis by combining concept-based reasoning with LVLMs’ few-shot capabilities, eliminating training requirements and reducing annotation burden.

Abstract: The main challenges limiting the adoption of deep learning-based solutions in medical workflows are the availability of annotated data and the lack of interpretability of such systems. Concept Bottleneck Models (CBMs) tackle the latter by constraining the model output on a set of predefined and human-interpretable concepts. However, the increased interpretability achieved through these concept-based explanations implies a higher annotation burden. Moreover, if a new concept needs to be added, the whole system needs to be retrained. Inspired by the remarkable performance shown by Large Vision-Language Models (LVLMs) in few-shot settings, we propose a simple, yet effective, methodology, CBVLM, which tackles both of the aforementioned challenges. First, for each concept, we prompt the LVLM to answer if the concept is present in the input image. Then, we ask the LVLM to classify the image based on the previous concept predictions. Moreover, in both stages, we incorporate a retrieval module responsible for selecting the best examples for in-context learning. By grounding the final diagnosis on the predicted concepts, we ensure explainability, and by leveraging the few-shot capabilities of LVLMs, we drastically lower the annotation cost. We validate our approach with extensive experiments across four medical datasets and twelve LVLMs (both generic and medical) and show that CBVLM consistently outperforms CBMs and task-specific supervised methods without requiring any training and using just a few annotated examples. More information on our project page: https://cristianopatricio.github.io/CBVLM/.

[418] UniUIR: Considering Underwater Image Restoration as An All-in-One Learner

Xu Zhang, Huan Zhang, Guoli Wang, Qian Zhang, Lefei Zhang, Bo Du

Main category: cs.CV

TL;DR: UniUIR is a universal underwater image restoration method that handles complex mixed distortions through Mamba Mixture-of-Experts, spatial-frequency prior generator, and depth-aware processing.

DetailsMotivation: Existing underwater image restoration methods only handle color distortion or jointly address color and haze issues, but overlook more complex degradations in real-world underwater scenes.

Method: Proposes UniUIR with three key components: Mamba Mixture-of-Experts module to decouple degradation-specific issues, spatial-frequency prior generator to extract degradation priors in both domains, and depth information integration from pre-trained models for region-dependent distortion handling.

Result: Extensive experiments show UniUIR produces more attractive results in qualitative and quantitative comparisons, with strong generalization compared to state-of-the-art methods.

Conclusion: UniUIR effectively addresses complex underwater mixed distortions through its all-in-one approach, demonstrating superior performance and generalization capabilities.

Abstract: Existing underwater image restoration (UIR) methods generally only handle color distortion or jointly address color and haze issues, but they often overlook the more complex degradations that can occur in underwater scenes. To address this limitation, we propose a Universal Underwater Image Restoration method, termed as UniUIR, considering the complex scenario of real-world underwater mixed distortions as an all-in-one manner. To decouple degradation-specific issues and explore the inter-correlations among various degradations in UIR task, we designed the Mamba Mixture-of-Experts module. This module enables each expert to identify distinct types of degradation and collaboratively extract task-specific priors while maintaining global feature representation based on linear complexity. Building upon this foundation, to enhance degradation representation and address the task conflicts that arise when handling multiple types of degradation, we introduce the spatial-frequency prior generator. This module extracts degradation prior information in both spatial and frequency domains, and adaptively selects the most appropriate task-specific prompts based on image content, thereby improving the accuracy of image restoration. Finally, to more effectively address complex, region-dependent distortions in UIR task, we incorporate depth information derived from a large-scale pre-trained depth prediction model, thereby enabling the network to perceive and leverage depth variations across different image regions to handle localized degradation. Extensive experiments demonstrate that UniUIR can produce more attractive results across qualitative and quantitative comparisons, and shows strong generalization than state-of-the-art methods.

[419] WonderHuman: Hallucinating Unseen Parts in Dynamic 3D Human Reconstruction

Zilong Wang, Zhiyang Dou, Yuan Liu, Cheng Lin, Xiao Dong, Yunhui Guo, Chenxu Zhang, Xin Li, Wenping Wang, Xiaohu Guo

Main category: cs.CV

TL;DR: WonderHuman reconstructs dynamic human avatars from monocular videos using 2D diffusion model priors and dual-space optimization for high-fidelity novel view synthesis, especially for unseen body parts.

DetailsMotivation: Previous methods require full body coverage in input videos, but daily practice often provides only limited viewpoints like monocular front-view videos, making reconstruction of unseen parts challenging.

Method: Leverages 2D generative diffusion model priors with Dual-Space Optimization (applying Score Distillation Sampling in both canonical and observation spaces), View Selection strategy, and Pose Feature Injection to ensure visual consistency and realism.

Result: Achieves state-of-the-art performance in producing photorealistic renderings from monocular videos, particularly for challenging unseen body parts.

Conclusion: WonderHuman effectively solves the problem of reconstructing dynamic human avatars from limited viewpoint videos by combining diffusion priors with novel optimization techniques.

Abstract: In this paper, we present WonderHuman to reconstruct dynamic human avatars from a monocular video for high-fidelity novel view synthesis. Previous dynamic human avatar reconstruction methods typically require the input video to have full coverage of the observed human body. However, in daily practice, one typically has access to limited viewpoints, such as monocular front-view videos, making it a cumbersome task for previous methods to reconstruct the unseen parts of the human avatar. To tackle the issue, we present WonderHuman, which leverages 2D generative diffusion model priors to achieve high-quality, photorealistic reconstructions of dynamic human avatars from monocular videos, including accurate rendering of unseen body parts. Our approach introduces a Dual-Space Optimization technique, applying Score Distillation Sampling (SDS) in both canonical and observation spaces to ensure visual consistency and enhance realism in dynamic human reconstruction. Additionally, we present a View Selection strategy and Pose Feature Injection to enforce the consistency between SDS predictions and observed data, ensuring pose-dependent effects and higher fidelity in the reconstructed avatar. In the experiments, our method achieves SOTA performance in producing photorealistic renderings from the given monocular video, particularly for those challenging unseen parts. The project page and source code can be found at https://wyiguanw.github.io/WonderHuman/.

[420] Controllable Video Generation with Provable Disentanglement

Yifan Shen, Peiyuan Zhu, Zijian Li, Shaoan Xie, Namrata Deka, Zongfang Liu, Zeyu Tang, Guangyi Chen, Kun Zhang

Main category: cs.CV

TL;DR: CoVoGAN proposes a novel approach for controllable video generation by disentangling static and dynamic latent variables, enabling independent control over individual video concepts through theoretical identifiability analysis and a Temporal Transition Module.

DetailsMotivation: Existing video generation methods treat videos as a whole, neglecting fine-grained spatiotemporal relationships, which limits control precision and efficiency. The paper aims to address this by enabling disentangled control over individual video concepts.

Method: The method disentangles static and dynamic latent variables using the minimal change principle. It achieves component-wise identifiability of dynamic variables through sufficient change property, implements a Temporal Transition Module to disentangle latent dynamics, and minimizes latent dynamic variable dimensionality while imposing temporal conditional independence.

Result: Extensive experiments on various video generation benchmarks show that CoVoGAN significantly improves generation quality and controllability across diverse real-world scenarios, both qualitatively and quantitatively.

Conclusion: CoVoGAN successfully enables efficient and independent control over individual video concepts through latent variable disentanglement, providing a theoretically grounded and practical solution for controllable video generation that outperforms existing methods.

Abstract: Controllable video generation remains a significant challenge, despite recent advances in generating high-quality and consistent videos. Most existing methods for controlling video generation treat the video as a whole, neglecting intricate fine-grained spatiotemporal relationships, which limits both control precision and efficiency. In this paper, we propose Controllable Video Generative Adversarial Networks (CoVoGAN) to disentangle the video concepts, thus facilitating efficient and independent control over individual concepts. Specifically, following the minimal change principle, we first disentangle static and dynamic latent variables. We then leverage the sufficient change property to achieve component-wise identifiability of dynamic latent variables, enabling disentangled control of video generation. To establish the theoretical foundation, we provide a rigorous analysis demonstrating the identifiability of our approach. Building on these theoretical insights, we design a Temporal Transition Module to disentangle latent dynamics. To enforce the minimal change principle and sufficient change property, we minimize the dimensionality of latent dynamic variables and impose temporal conditional independence. To validate our approach, we integrate this module as a plug-in for GANs. Extensive qualitative and quantitative experiments on various video generation benchmarks demonstrate that our method significantly improves generation quality and controllability across diverse real-world scenarios.

[421] Medical Image Classification with KAN-Integrated Transformers and Dilated Neighborhood Attention

Omid Nejati Manzari, Hojat Asgariandehkordi, Taha Koleilat, Yiming Xiao, Hassan Rivaz

Main category: cs.CV

TL;DR: MedViTV2 introduces KAN layers into transformers for medical image classification, addressing real-world image corruptions with enhanced efficiency and accuracy.

DetailsMotivation: Existing methods like CNNs and transformers are designed for clean images, but real clinical data has corruptions from multi-center studies and equipment variations.

Method: Incorporates KAN layers into transformers, uses efficient KAN blocks, enhanced Dilated Neighborhood Attention (DiNA) for global context, and hierarchical hybrid strategy for local-global feature balance.

Result: Achieved SOTA in 27/29 experiments on 17 medical image datasets and 12 corrupted datasets, with 44% computational efficiency improvement and accuracy gains of 4.6-13.4% on benchmarks.

Conclusion: MedViTV2 effectively addresses medical image corruption challenges with improved efficiency and performance, demonstrating strong generalization across diverse medical imaging tasks.

Abstract: Convolutional networks, transformers, hybrid models, and Mamba-based architectures have demonstrated strong performance across various medical image classification tasks. However, these methods were primarily designed to classify clean images using labeled data. In contrast, real-world clinical data often involve image corruptions that are unique to multi-center studies and stem from variations in imaging equipment across manufacturers. In this paper, we introduce the Medical Vision Transformer (MedViTV2), a novel architecture incorporating Kolmogorov-Arnold Network (KAN) layers into the transformer architecture for the first time, aiming for generalized medical image classification. We have developed an efficient KAN block to reduce computational load while enhancing the accuracy of the original MedViT. Additionally, to counteract the fragility of our MedViT when scaled up, we propose an enhanced Dilated Neighborhood Attention (DiNA), an adaptation of the efficient fused dot-product attention kernel capable of capturing global context and expanding receptive fields to scale the model effectively and addressing feature collapse issues. Moreover, a hierarchical hybrid strategy is introduced to stack our Local Feature Perception and Global Feature Perception blocks in an efficient manner, which balances local and global feature perceptions to boost performance. Extensive experiments on 17 medical image classification datasets and 12 corrupted medical image datasets demonstrate that MedViTV2 achieved state-of-the-art results in 27 out of 29 experiments with reduced computational complexity. MedViTV2 is 44% more computationally efficient than the previous version and significantly enhances accuracy, achieving improvements of 4.6% on MedMNIST, 5.8% on NonMNIST, and 13.4% on the MedMNIST-C benchmark.

[422] Generative Human Geometry Distribution

Xiangjun Tang, Biao Zhang, Peter Wonka

Main category: cs.CV

TL;DR: Proposes a new geometry distribution model for realistic human geometry generation using 2D feature maps and SMPL-based domain, achieving 57% improvement in geometry quality.

DetailsMotivation: Realistic human geometry generation is challenging due to the need to preserve fine clothing details and accurately model clothing-body interactions. Existing single-geometry distributions are inefficient for large-scale learning.

Method: Two key techniques: encoding distributions as 2D feature maps instead of network parameters, and using SMPL models as domain with refined flow velocity field. Two-stage training paradigm with diffusion flow model for compression and another flow model on latent space.

Result: Outperforms state-of-the-art methods with 57% improvement in geometry quality on pose-conditioned random avatar generation and avatar-consistent novel pose synthesis tasks.

Conclusion: The proposed geometry distribution model effectively addresses the challenges of realistic human geometry generation and demonstrates superior performance compared to existing methods.

Abstract: Realistic human geometry generation is an important yet challenging task, requiring both the preservation of fine clothing details and the accurate modeling of clothing-body interactions. To tackle this challenge, we build upon Geometry distributions, a recently proposed representation that can model a single human geometry with high fidelity using a flow matching model. However, extending a single-geometry distribution to a dataset is non-trivial and inefficient for large-scale learning. To address this, we propose a new geometry distribution model by two key techniques: (1) encoding distributions as 2D feature maps rather than network parameters, and (2) using SMPL models as the domain instead of Gaussian and refining the associated flow velocity field. We then design a generative framework adopting a two staged training paradigm analogous to state-of-the-art image and 3D generative models. In the first stage, we compress geometry distributions into a latent space using a diffusion flow model; the second stage trains another flow model on this latent space. We validate our approach on two key tasks: pose-conditioned random avatar generation and avatar-consistent novel pose synthesis. Experimental results demonstrate that our method outperforms existing state-of-the-art methods, achieving a 57% improvement in geometry quality.

[423] Novel Object 6D Pose Estimation with a Single Reference View

Jian Liu, Wei Sun, Kai Zeng, Jin Zheng, Hui Yang, Hossein Rahmani, Ajmal Mian, Lin Wang

Main category: cs.CV

TL;DR: SinRef-6D enables novel object 6D pose estimation using only a single reference view, achieving performance comparable to CAD-based methods without requiring CAD models or dense reference views.

DetailsMotivation: Existing methods rely on CAD models or dense reference views which are difficult to acquire. Single reference view approaches are more scalable but challenging due to large pose discrepancies and limited geometric information.

Method: Uses iterative point-wise alignment in a common coordinate system based on state space models (SSMs). RGB and Points SSMs capture long-range dependencies and spatial information from single views with linear complexity.

Result: Achieves on-par performance with CAD-based and dense reference view-based methods across six popular datasets and real-world robotic scenes, despite the more challenging single reference setting.

Conclusion: SinRef-6D provides a scalable solution for novel object 6D pose estimation that works with only a single reference view, eliminating the need for CAD models or dense reference views while maintaining competitive performance.

Abstract: Existing novel object 6D pose estimation methods typically rely on CAD models or dense reference views, which are both difficult to acquire. Using only a single reference view is more scalable, but challenging due to large pose discrepancies and limited geometric and spatial information. To address these issues, we propose a Single-Reference-based novel object 6D (SinRef-6D) pose estimation method. Our key idea is to iteratively establish point-wise alignment in a common coordinate system based on state space models (SSMs). Specifically, iterative object-space point-wise alignment can effectively handle large pose discrepancies, while our proposed RGB and Points SSMs can capture long-range dependencies and spatial information from a single view, offering linear complexity and superior spatial modeling capability. Once pre-trained on synthetic data, SinRef-6D can estimate the 6D pose of a novel object using only a single reference view, without requiring retraining or a CAD model. Extensive experiments on six popular datasets and real-world robotic scenes demonstrate that we achieve on-par performance with CAD-based and dense reference view-based methods, despite operating in the more challenging single reference setting. Code will be released at https://github.com/CNJianLiu/SinRef-6D.

[424] PRO-VPT: Distribution-Adaptive Visual Prompt Tuning via Prompt Relocation

Chikai Shang, Mengke Li, Yiqun Zhang, Zhen Chen, Jinlin Wu, Fangqing Gu, Yang Lu, Yiu-ming Cheung

Main category: cs.CV

TL;DR: PRO-VPT introduces adaptive distribution optimization for visual prompt tuning by iteratively relocating prompts between blocks based on task-specific needs, achieving state-of-the-art performance on VTAB-1k and FGVC benchmarks.

DetailsMotivation: Current VPT methods use fixed prompt distributions across different tasks, ignoring that the importance of each block varies depending on the task, which limits performance potential.

Method: Proposes PRO-VPT framework with iterative prompt relocation strategy: pruning idle prompts from saturated blocks and allocating them to prompt-needed blocks, formulated as nested optimization between ADO and VPT.

Result: Significantly outperforms advanced VPT methods, achieving 1.6 pp and 2.0 pp average accuracy improvements over VPT on VTAB-1k and FGVC benchmarks, leading to state-of-the-art performance.

Conclusion: Adaptive prompt distribution optimization through iterative relocation unlocks the full potential of VPT, demonstrating the importance of task-specific prompt allocation strategies.

Abstract: Visual prompt tuning (VPT), i.e., fine-tuning some lightweight prompt tokens, provides an efficient and effective approach for adapting pre-trained models to various downstream tasks. However, most prior art indiscriminately uses a fixed prompt distribution across different tasks, neglecting the importance of each block varying depending on the task. In this paper, we introduce adaptive distribution optimization (ADO) by tackling two key questions: (1) How to appropriately and formally define ADO, and (2) How to design an adaptive distribution strategy guided by this definition? Through empirical analysis, we first confirm that properly adjusting the distribution significantly improves VPT performance, and further uncover a key insight that a nested relationship exists between ADO and VPT. Based on these findings, we propose a new VPT framework, termed PRO-VPT (iterative Prompt RelOcation-based VPT), which adaptively adjusts the distribution built upon a nested optimization formulation. Specifically, we develop a prompt relocation strategy derived from this formulation, comprising two steps: pruning idle prompts from prompt-saturated blocks, followed by allocating these prompts to the most prompt-needed blocks. By iteratively performing prompt relocation and VPT, our proposal can adaptively learn the optimal prompt distribution in a nested optimization-based manner, thereby unlocking the full potential of VPT. Extensive experiments demonstrate that our proposal significantly outperforms advanced VPT methods, e.g., PRO-VPT surpasses VPT by 1.6 pp and 2.0 pp average accuracy, leading prompt-based methods to state-of-the-art performance on VTAB-1k and FGVC benchmarks. The code is available at https://github.com/ckshang/PRO-VPT.

[425] FEB-Cache: Frequency-Guided Exposure Bias Reduction for Enhancing Diffusion Transformer Caching

Zhen Zou, Feng Zhao

Main category: cs.CV

TL;DR: FEB-Cache addresses exposure bias in Diffusion Transformers by separating Attention and MLP caching based on frequency response characteristics, improving generation quality while maintaining acceleration.

DetailsMotivation: Current feature caching methods for Diffusion Transformers damage generation quality by amplifying exposure bias, but existing approaches don't analyze why caching causes this degradation.

Method: Proposed FEB-Cache, a joint caching strategy that separates Attention and MLP caching based on frequency-guided cache table to better align with non-exposed bias diffusion process.

Result: Empirical results show FEB-Cache optimizes model performance while facilitating acceleration, reducing exposure bias in cached diffusion processes.

Conclusion: FEB-Cache provides a new perspective on leveraging caching to accelerate diffusion processes by understanding and addressing the frequency response mismatch in Attention and MLP components.

Abstract: Diffusion Transformer (DiT) has exhibited impressive generation capabilities but faces great challenges due to its high computational complexity. To address this issue, various methods, notably feature caching, have been introduced. However, these approaches focus on aligning non-cache diffusion without analyzing why caching damage the generation processes. In this paper, we first confirm that the cache greatly amplifies the exposure bias, resulting in a decline in the generation quality. However, directly applying noise scaling is challenging for this issue due to the non-smoothness of exposure bias. We found that this phenomenon stems from the mismatch between its frequency response characteristics and the simple cache of Attention and MLP. Since these two components exhibit unique preferences for frequency signals, which provides us with a caching strategy to separate Attention and MLP to achieve an enhanced fit of exposure bias and reduce it. Based on this, we introduced FEB-Cache, a joint caching strategy that aligns with the non-exposed bias diffusion process (which gives us a higher performance cap) of caching Attention and MLP based on the frequency-guided cache table. Our approach combines a comprehensive understanding of the caching mechanism and offers a new perspective on leveraging caching to accelerate the diffusion process. Empirical results indicate that FEB-Cache optimizes model performance while concurrently facilitating acceleration. Code is available at https://github.com/aSleepyTree/EB-Cache.

[426] TRCE: Towards Reliable Malicious Concept Erasure in Text-to-Image Diffusion Models

Ruidong Chen, Honglin Guo, Lanjun Wang, Chenyu Zhang, Weizhi Nie, An-An Liu

Main category: cs.CV

TL;DR: TRCE is a two-stage concept erasure method that effectively removes malicious concepts from text-to-image diffusion models while preserving normal generation capabilities, addressing limitations in handling implicitly embedded malicious prompts.

DetailsMotivation: Current concept erasure methods struggle to fully erase malicious concepts that are implicitly embedded in prompts (metaphorical expressions or adversarial prompts) while maintaining the model's normal generation capability, creating a need for better trade-off between reliable erasure and knowledge preservation.

Method: Two-stage approach: 1) Erase malicious semantics in textual prompts by optimizing cross-attention layers to map malicious prompts to contextually similar safe concepts using [EoT] embedding; 2) Use contrastive learning to steer early denoising predictions toward safe directions and away from unsafe ones, leveraging the deterministic properties of diffusion model sampling trajectories.

Result: Comprehensive evaluations on multiple malicious concept erasure benchmarks demonstrate TRCE’s effectiveness in erasing malicious concepts while better preserving the model’s original generation ability compared to existing methods.

Conclusion: TRCE provides an effective solution for achieving reliable concept erasure in text-to-image diffusion models, successfully balancing the removal of malicious content with the preservation of normal generation capabilities through its two-stage strategy.

Abstract: Recent advances in text-to-image diffusion models enable photorealistic image generation, but they also risk producing malicious content, such as NSFW images. To mitigate risk, concept erasure methods are studied to facilitate the model to unlearn specific concepts. However, current studies struggle to fully erase malicious concepts implicitly embedded in prompts (e.g., metaphorical expressions or adversarial prompts) while preserving the model’s normal generation capability. To address this challenge, our study proposes TRCE, using a two-stage concept erasure strategy to achieve an effective trade-off between reliable erasure and knowledge preservation. Firstly, TRCE starts by erasing the malicious semantics implicitly embedded in textual prompts. By identifying a critical mapping objective(i.e., the [EoT] embedding), we optimize the cross-attention layers to map malicious prompts to contextually similar prompts but with safe concepts. This step prevents the model from being overly influenced by malicious semantics during the denoising process. Following this, considering the deterministic properties of the sampling trajectory of the diffusion model, TRCE further steers the early denoising prediction toward the safe direction and away from the unsafe one through contrastive learning, thus further avoiding the generation of malicious content. Finally, we conduct comprehensive evaluations of TRCE on multiple malicious concept erasure benchmarks, and the results demonstrate its effectiveness in erasing malicious concepts while better preserving the model’s original generation ability. The code is available at: http://github.com/ddgoodgood/TRCE. CAUTION: This paper includes model-generated content that may contain offensive material.

[427] Exploring Representation Invariance in Finetuning

Wenqiang Zu, Shenghao Xie, Hao Chen, Zhiqiang Chen, Liwen Hu, Yuanhao Xi, Yiming Liang, Junliang Ye, Bo Lei, Tiejun Huang, Guoqi Li, Lei Ma

Main category: cs.CV

TL;DR: RIFT is a regularization method that preserves pretrained representations during finetuning by maximizing similarity between pretrained and finetuned models using orthogonal invariance.

DetailsMotivation: Pretrained foundation models lose their generalizable representations during finetuning, degrading their original capabilities when adapted to downstream tasks.

Method: Representation Invariance FineTuning (RIFT) - a regularization that leverages orthogonal invariance of manifolds to maximize representation similarity between pretrained and finetuned models in a computationally efficient way.

Result: RIFT is compatible with mainstream finetuning methods, offering competitive or enhanced performance while better preserving model generalizability.

Conclusion: Downstream tasks can be effectively adapted without sacrificing the benefits of pretrained representations through RIFT regularization.

Abstract: Foundation models pretrained on large-scale natural images are widely adapted to various cross-domain low-resource downstream tasks, benefiting from generalizable and transferable patterns captured by their representations. However, these representations are later found to gradually vanish during finetuning, accompanied by a degradation of model’s original generalizability. In this paper, we argue that such tasks can be effectively adapted without sacrificing the benefits of pretrained representations. We approach this by introducing \textit{Representation Invariance FineTuning (RIFT)}, a regularization that maximizes the representation similarity between pretrained and finetuned models by leveraging orthogonal invariance of manifolds in a computationally efficient way. Experiments demonstrate that our method is compatible with mainstream finetuning methods, offering competitive or even enhanced performance and better preservation of the generalizability.

[428] Explaining Human Preferences via Metrics for Structured 3D Reconstruction

Jack Langerman, Denys Rozumnyi, Yuzhong Huang, Dmytro Mishkin

Main category: cs.CV

TL;DR: This paper presents a comprehensive analysis of automated metrics for evaluating structured 3D reconstructions, including pitfalls analysis, expert preference studies, systematic unit tests, and a novel learned metric based on human judgments.

DetailsMotivation: The driving force is that 'what cannot be measured cannot be improved' - highlighting the need for reliable metrics to evaluate and improve structured 3D reconstruction methods.

Method: The paper analyzes existing metrics, discusses their pitfalls, conducts expert preference studies, proposes systematic unit tests to verify desirable properties, provides context-aware metric recommendations, and develops a learned metric distilled from human expert judgments.

Result: The research provides a detailed framework for evaluating 3D reconstruction metrics, identifies limitations of current approaches, and introduces a new learned metric that better aligns with human expert preferences.

Conclusion: The work establishes comprehensive guidelines for metric selection in 3D reconstruction evaluation and proposes a human-aligned learned metric as a superior alternative to traditional approaches, with open-source implementation available.

Abstract: “What cannot be measured cannot be improved” while likely never uttered by Lord Kelvin, summarizes effectively the driving force behind this work. This paper presents a detailed discussion of automated metrics for evaluating structured 3D reconstructions. Pitfalls of each metric are discussed, and an analysis through the lens of expert 3D modelers’ preferences is presented. A set of systematic “unit tests” are proposed to empirically verify desirable properties, and context aware recommendations regarding which metric to use depending on application are provided. Finally, a learned metric distilled from human expert judgments is proposed and analyzed. The source code is available at https://github.com/s23dr/wireframe-metrics-iccv2025

[429] Motion Blender Gaussian Splatting for Dynamic Scene Reconstruction

Xinyu Zhang, Haonan Chang, Yuhan Liu, Abdeslam Boularias

Main category: cs.CV

TL;DR: MBGS introduces motion graphs as explicit motion representation for Gaussian splatting, enabling controllable dynamic scene reconstruction and manipulation for robotics applications.

DetailsMotivation: Existing Gaussian splatting methods use implicit motion representations that limit motion manipulation and controllability, restricting applications in robotics.

Method: Uses motion graphs with dual quaternion skinning and learnable weight painting functions to propagate motion to individual Gaussians, jointly optimized via differentiable rendering.

Result: Achieves state-of-the-art performance on iPhone dataset and competitive results on HyperNeRF, enabling novel pose animation, robot demonstration synthesis, and visual planning.

Conclusion: MBGS provides explicit motion control for Gaussian splatting, expanding applications in robotics through motion graph representation and differentiable optimization.

Abstract: Gaussian splatting has emerged as a powerful tool for high-fidelity reconstruction of dynamic scenes. However, existing methods primarily rely on implicit motion representations, such as encoding motions into neural networks or per-Gaussian parameters, which makes it difficult to further manipulate the reconstructed motions. This lack of explicit controllability limits existing methods to replaying recorded motions only, which hinders a wider application in robotics. To address this, we propose Motion Blender Gaussian Splatting (MBGS), a novel framework that uses motion graphs as an explicit and sparse motion representation. The motion of a graph’s links is propagated to individual Gaussians via dual quaternion skinning, with learnable weight painting functions that determine the influence of each link. The motion graphs and 3D Gaussians are jointly optimized from input videos via differentiable rendering. Experiments show that MBGS achieves state-of-the-art performance on the highly challenging iPhone dataset while being competitive on HyperNeRF. We demonstrate the application potential of our method in animating novel object poses, synthesizing real robot demonstrations, and predicting robot actions through visual planning. The source code, models, video demonstrations can be found at http://mlzxy.github.io/motion-blender-gs.

[430] DecompDreamer: A Composition-Aware Curriculum for Structured 3D Asset Generation

Utkarsh Nath, Rajeev Goel, Rahul Khurana, Kyle Min, Mark Ollila, Pavan Turaga, Varun Jampani, Tejaswi Gowda

Main category: cs.CV

TL;DR: DecompDreamer introduces staged optimization to solve compositional text-to-3D generation by prioritizing inter-object relationships first, then refining individual components, avoiding gradient conflicts.

DetailsMotivation: Current text-to-3D methods fail on compositional prompts due to optimization conflicts from simultaneous or iterative heuristics, leading to entangled geometry or catastrophic divergence.

Method: A staged optimization strategy that functions as an implicit curriculum - first establishing structural scaffolds by prioritizing inter-object relationships, then shifting to high-fidelity refinement of individual components.

Result: Outperforms state-of-the-art methods in fidelity, disentanglement, and spatial coherence on diverse compositional prompts in both qualitative and quantitative evaluations.

Conclusion: Temporal decoupling of competing objectives through staged optimization provides a robust solution to gradient conflict in compositional text-to-3D generation.

Abstract: Current text-to-3D methods excel at generating single objects but falter on compositional prompts. We argue this failure is fundamental to their optimization schedules, as simultaneous or iterative heuristics predictably collapse under a combinatorial explosion of conflicting gradients, leading to entangled geometry or catastrophic divergence. In this paper, we reframe the core challenge of compositional generation as one of optimization scheduling. We introduce DecompDreamer, a framework built on a novel staged optimization strategy that functions as an implicit curriculum. Our method first establishes a coherent structural scaffold by prioritizing inter-object relationships before shifting to the high-fidelity refinement of individual components. This temporal decoupling of competing objectives provides a robust solution to gradient conflict. Qualitative and quantitative evaluations on diverse compositional prompts demonstrate that DecompDreamer outperforms state-of-the-art methods in fidelity, disentanglement, and spatial coherence.

[431] LIAM: Multimodal Transformer for Language Instructions, Images, Actions and Semantic Maps

Yihao Wang, Raphael Memmesheimer, Sven Behnke

Main category: cs.CV

TL;DR: LIAM is an end-to-end model that predicts action transcripts for domestic service robots using language, image, action, and map inputs, with CLIP-based encoding and pre-training for modality alignment.

DetailsMotivation: To address the large variability of domestic tasks without implementing each task individually by providing robots with task descriptions and environment information using large language models and open-vocabulary object perception.

Method: Proposes LIAM model with CLIP backbone for encoding language and image inputs, designed two pre-training tasks to fine-tune weights and pre-align latent spaces, incorporates semantic maps.

Result: Evaluated on ALFRED dataset, demonstrates importance of pre-aligning embedding spaces from different modalities and efficacy of incorporating semantic maps.

Conclusion: The approach enables flexible domestic task execution through multimodal inputs and proper modality alignment, showing promising results for service robotics.

Abstract: The availability of large language models and open-vocabulary object perception methods enables more flexibility for domestic service robots. The large variability of domestic tasks can be addressed without implementing each task individually by providing the robot with a task description along with appropriate environment information. In this work, we propose LIAM - an end-to-end model that predicts action transcripts based on language, image, action, and map inputs. Language and image inputs are encoded with a CLIP backbone, for which we designed two pre-training tasks to fine-tune its weights and pre-align the latent spaces. We evaluate our method on the ALFRED dataset, a simulator-generated benchmark for domestic tasks. Our results demonstrate the importance of pre-aligning embedding spaces from different modalities and the efficacy of incorporating semantic maps.

[432] AutoDrive-QA: A Multiple-Choice Benchmark for Vision-Language Evaluation in Urban Autonomous Driving

Boshra Khalili, Andrew W. Smyth

Main category: cs.CV

TL;DR: AutoDrive-QA is a new benchmark that converts open-ended driving QA datasets into structured multiple-choice questions with realistic distractors to enable standardized evaluation of vision-language models in urban driving contexts.

DetailsMotivation: Existing benchmarks for evaluating vision-language models in urban driving rely on ambiguous open-ended responses that are difficult to score consistently, slowing progress toward safe and reliable AI for autonomous driving.

Method: Systematically converts three driving QA datasets (DriveLM, NuScenes-QA, LingoQA) into multiple-choice questions with distractors grounded in five realistic error categories: Driving Domain Misconceptions, Logical Inconsistencies, Misinterpreted Sensor Inputs, Computational Oversights, and Question Ambiguity.

Result: Fine-tuned LLaVA-1.5-7B improved accuracy by ~6 percentage points across tasks, GPT-4V achieved strongest zero-shot performance (up to 69.8% accuracy), and Qwen2-VL performed competitively, especially in multi-view settings. Traditional metrics like BLEU and CIDEr failed to distinguish model performance.

Conclusion: AutoDrive-QA provides an objective, domain-grounded evaluation protocol for more transparent benchmarking of urban AI systems, supporting development of safer autonomous driving technologies.

Abstract: Evaluating vision-language models (VLMs) in urban driving contexts remains challenging, as existing benchmarks rely on open-ended responses that are ambiguous, annotation-intensive, and inconsistent to score. This lack of standardized evaluation slows progress toward safe and reliable AI for urban mobility. We introduce AutoDrive-QA, the first benchmark that systematically converts open-ended driving QA datasets (DriveLM, NuScenes-QA, LingoQA) into structured multiple-choice questions (MCQs) with distractors grounded in five realistic error categories: Driving Domain Misconceptions, Logical Inconsistencies, Misinterpreted Sensor Inputs, Computational Oversights, and Question Ambiguity. This framework enables reproducible and interpretable evaluation of VLMs across perception, prediction, and planning tasks in complex urban scenes. Experiments show that fine-tuning LLaVA-1.5-7B improves accuracy by about six percentage points across tasks, GPT-4V achieves the strongest zero-shot performance with up to 69.8% accuracy, and Qwen2-VL models also perform competitively, particularly in multi-view settings. Moreover, traditional metrics such as BLEU and CIDEr fail to distinguish strong from weak models. By providing an objective, domain-grounded evaluation protocol, AutoDrive-QA contributes to more transparent benchmarking of urban AI systems, supporting the development of safer and more trustworthy autonomous driving technologies for smart cities.

[433] Jasmine: Harnessing Diffusion Prior for Self-supervised Depth Estimation

Jiyuan Wang, Chunyu Lin, Cheng Guan, Lang Nie, Jing He, Haodong Li, Kang Liao, Yao Zhao

Main category: cs.CV

TL;DR: Jasmine is the first Stable Diffusion-based self-supervised framework for monocular depth estimation that leverages SD’s visual priors to improve prediction sharpness and generalization without requiring supervision.

DetailsMotivation: Previous SD-based methods require high-precision supervision for dense prediction, while self-supervised methods suffer from blurs and artifacts that compromise SD's latent priors due to challenges like occlusions and texture-less regions.

Method: Proposes a hybrid image reconstruction surrogate task that preserves SD’s detail priors without additional supervision, and introduces Scale-Shift GRU to bridge distribution gaps between SD’s scale-shift invariant estimation and self-supervised scale-invariant depth estimation.

Result: Jasmine achieves state-of-the-art performance on the KITTI benchmark and demonstrates superior zero-shot generalization across multiple datasets.

Conclusion: The framework successfully harnesses Stable Diffusion’s visual priors for self-supervised depth estimation while overcoming inherent challenges through hybrid reconstruction and scale-shift alignment techniques.

Abstract: In this paper, we propose Jasmine, the first Stable Diffusion (SD)-based self-supervised framework for monocular depth estimation, which effectively harnesses SD’s visual priors to enhance the sharpness and generalization of unsupervised prediction. Previous SD-based methods are all supervised since adapting diffusion models for dense prediction requires high-precision supervision. In contrast, self-supervised reprojection suffers from inherent challenges (e.g., occlusions, texture-less regions, illumination variance), and the predictions exhibit blurs and artifacts that severely compromise SD’s latent priors. To resolve this, we construct a novel surrogate task of hybrid image reconstruction. Without any additional supervision, it preserves the detail priors of SD models by reconstructing the images themselves while preventing depth estimation from degradation. Furthermore, to address the inherent misalignment between SD’s scale and shift invariant estimation and self-supervised scale-invariant depth estimation, we build the Scale-Shift GRU. It not only bridges this distribution gap but also isolates the fine-grained texture of SD output against the interference of reprojection loss. Extensive experiments demonstrate that Jasmine achieves SoTA performance on the KITTI benchmark and exhibits superior zero-shot generalization across multiple datasets.

[434] MetaSpatial: Reinforcing 3D Spatial Reasoning in VLMs for the Metaverse

Zhenyu Pan, Han Liu

Main category: cs.CV

TL;DR: MetaSpatial is the first RL-based framework that enhances 3D spatial reasoning in VLMs for real-time 3D scene generation without hard-coded optimizations.

DetailsMotivation: Addresses the lack of internalized 3D spatial reasoning in VLMs and inefficiency of traditional supervised fine-tuning for layout generation tasks where perfect ground truth is unavailable.

Method: Uses multi-turn RL-based optimization with physics-aware constraints and rendered image evaluations, featuring adaptive iterative reasoning where VLMs refine spatial arrangements over multiple turns by analyzing rendered outputs.

Result: Significantly enhances spatial consistency and formatting stability of various scale models, with more realistic, aligned, and functionally coherent object placements.

Conclusion: Validates RL’s effectiveness for 3D spatial reasoning in metaverse, AR/VR, digital twins, and game development applications, with publicly available code, data, and training pipeline.

Abstract: We present MetaSpatial, the first reinforcement learning (RL)-based framework designed to enhance 3D spatial reasoning in vision-language models (VLMs), enabling real-time 3D scene generation without the need for hard-coded optimizations. MetaSpatial addresses two core challenges: (i) the lack of internalized 3D spatial reasoning in VLMs, which limits their ability to generate realistic layouts, and (ii) the inefficiency of traditional supervised fine-tuning (SFT) for layout generation tasks, as perfect ground truth annotations are unavailable. Our key innovation is a multi-turn RL-based optimization mechanism that integrates physics-aware constraints and rendered image evaluations, ensuring generated 3D layouts are coherent, physically plausible, and aesthetically consistent. Methodologically, MetaSpatial introduces an adaptive, iterative reasoning process, where the VLM refines spatial arrangements over multiple turns by analyzing rendered outputs, improving scene coherence progressively. Empirical evaluations demonstrate that MetaSpatial significantly enhances the spatial consistency and formatting stability of various scale models. Post-training, object placements are more realistic, aligned, and functionally coherent, validating the effectiveness of RL for 3D spatial reasoning in metaverse, AR/VR, digital twins, and game development applications. Our code, data, and training pipeline are publicly available at https://github.com/PzySeere/MetaSpatial.

[435] Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning of Vision Language Models

Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Xiansheng Chen, Pengwei Wang, Zhongyuan Wang, Shanghang Zhang

Main category: cs.CV

TL;DR: Reason-RFT is a two-stage reinforcement fine-tuning framework that enhances visual reasoning in VLMs through SFT with CoT data followed by GRPO-based reinforcement learning, achieving SOTA performance with strong generalization and data efficiency.

DetailsMotivation: Existing CoT supervised fine-tuning methods for VLMs cause overfitting and cognitive rigidity, limiting generalization under domain shifts and reducing real-world applicability.

Method: Two-stage framework: 1) SFT with curated CoT data to activate reasoning potential, 2) reinforcement learning using Group Relative Policy Optimization (GRPO) to generate multiple reasoning-response pairs for domain shift adaptability.

Result: Achieves SOTA results, outperforms open-source and proprietary models, maintains robust performance under domain shifts, and excels in few-shot learning scenarios surpassing full-dataset SFT baselines.

Conclusion: Reason-RFT introduces a novel training paradigm for visual reasoning and represents significant progress in multimodal research, addressing limitations of existing CoT fine-tuning approaches.

Abstract: Visual reasoning abilities play a crucial role in understanding complex multimodal data, advancing both domain-specific applications and artificial general intelligence (AGI). Existing methods enhance Vision-Language Models (VLMs) through Chain-of-Thought (CoT) supervised fine-tuning using meticulously annotated data. However, this approach may lead to overfitting and cognitive rigidity, limiting the model’s generalization ability under domain shifts and reducing real-world applicability. To overcome these limitations, we propose Reason-RFT, a two-stage reinforcement fine-tuning framework for visual reasoning. First, Supervised Fine-Tuning (SFT) with curated CoT data activates the reasoning potential of VLMs. This is followed by reinforcement learning based on Group Relative Policy Optimization (GRPO), which generates multiple reasoning-response pairs to enhance adaptability to domain shifts. To evaluate Reason-RFT, we reconstructed a comprehensive dataset covering visual counting, structural perception, and spatial transformation, serving as a benchmark for systematic assessment across three key dimensions. Experimental results highlight three advantages: (1) performance enhancement, with Reason-RFT achieving state-of-the-art results and outperforming both open-source and proprietary models; (2) generalization superiority, maintaining robust performance under domain shifts across various tasks; and (3) data efficiency, excelling in few-shot learning scenarios and surpassing full-dataset SFT baselines. Reason-RFT introduces a novel training paradigm for visual reasoning and marks a significant step forward in multimodal research. Project website: https://tanhuajie.github.io/ReasonRFT

[436] FVQ: A Large-Scale Dataset and an LMM-based Method for Face Video Quality Assessment

Sijing Wu, Yunhao Li, Ziwen Xu, Yixuan Gao, Huiyu Duan, Wei Sun, Guangtao Zhai

Main category: cs.CV

TL;DR: This paper introduces FVQ-20K, the first large-scale in-the-wild face video quality assessment dataset with 20,000 videos and MOS annotations, and proposes FVQ-Rater, a specialized method using large multimodal models for FVQA.

DetailsMotivation: Face video quality assessment is important because face videos are primary content on social media and human visual system is particularly sensitive to human faces, but FVQA is rarely explored due to lack of large-scale datasets.

Method: Proposed FVQ-Rater method extracts multi-dimensional features (spatial, temporal, portrait features, and face embeddings) and uses LoRA-based instruction tuning for quality-specific fine-tuning of large multimodal models.

Result: The method shows superior performance on both FVQ-20K and CFVQA datasets, demonstrating significant potential for FVQA development.

Conclusion: The FVQ-20K dataset and FVQ-Rater method significantly advance face video quality assessment research and show promising results for future development in this domain.

Abstract: Face video quality assessment (FVQA) deserves to be explored in addition to general video quality assessment (VQA), as face videos are the primary content on social media platforms and human visual system (HVS) is particularly sensitive to human faces. However, FVQA is rarely explored due to the lack of large-scale FVQA datasets. To fill this gap, we present the first large-scale in-the-wild FVQA dataset, FVQ-20K, which contains 20,000 in-the-wild face videos together with corresponding mean opinion score (MOS) annotations. Along with the FVQ-20K dataset, we further propose a specialized FVQA method named FVQ-Rater to achieve human-like rating and scoring for face video, which is the first attempt to explore the potential of large multimodal models (LMMs) for the FVQA task. Concretely, we elaborately extract multi-dimensional features including spatial features, temporal features, and face-specific features (i.e., portrait features and face embeddings) to provide comprehensive visual information, and take advantage of the LoRA-based instruction tuning technique to achieve quality-specific fine-tuning, which shows superior performance on both FVQ-20K and CFVQA datasets. Extensive experiments and comprehensive analysis demonstrate the significant potential of the FVQ-20K dataset and FVQ-Rater method in promoting the development of FVQA.

[437] From Gaze to Insight: Bridging Human Visual Attention and Vision Language Model Explanation for Weakly-Supervised Medical Image Segmentation

Jingkun Chen, Haoran Duan, Xiao Zhang, Boyan Gao, Vicente Grau, Jungong Han

Main category: cs.CV

TL;DR: A teacher-student framework that integrates clinician gaze data and vision-language models for medical image segmentation, achieving 3-5% improvement over gaze-only baselines without additional annotation costs.

DetailsMotivation: Medical image segmentation suffers from high annotation costs. Gaze data provides diagnostic focus areas but is sparse, while vision-language models offer semantic context but lack precision. Neither source alone suffices for effective segmentation.

Method: Teacher-student framework where teacher learns from gaze points enhanced by VLM-generated lesion descriptions. Uses multi-scale feature alignment, confidence-weighted consistency constraints, and adaptive masking to guide student model.

Result: Achieved Dice scores of 80.78% (Kvasir-SEG), 80.53% (NCI-ISBI), and 84.22% (ISIC) - improving 3-5% over gaze baselines while maintaining clinical interpretability.

Conclusion: Integrating human visual attention with AI-generated semantic context effectively overcomes limitations of individual weak supervision signals, advancing annotation-efficient medical AI systems.

Abstract: Medical image segmentation remains challenging due to the high cost of pixel-level annotations for training. In the context of weak supervision, clinician gaze data captures regions of diagnostic interest; however, its sparsity limits its use for segmentation. In contrast, vision-language models (VLMs) provide semantic context through textual descriptions but lack the explanation precision required. Recognizing that neither source alone suffices, we propose a teacher-student framework that integrates both gaze and language supervision, leveraging their complementary strengths. Our key insight is that gaze data indicates where clinicians focus during diagnosis, while VLMs explain why those regions are significant. To implement this, the teacher model first learns from gaze points enhanced by VLM-generated descriptions of lesion morphology, establishing a foundation for guiding the student model. The teacher then directs the student through three strategies: (1) Multi-scale feature alignment to fuse visual cues with textual semantics; (2) Confidence-weighted consistency constraints to focus on reliable predictions; (3) Adaptive masking to limit error propagation in uncertain areas. Experiments on the Kvasir-SEG, NCI-ISBI, and ISIC datasets show that our method achieves Dice scores of 80.78%, 80.53%, and 84.22%, respectively-improving 3-5% over gaze baselines without increasing the annotation burden. By preserving correlations among predictions, gaze data, and lesion descriptions, our framework also maintains clinical interpretability. This work illustrates how integrating human visual attention with AI-generated semantic context can effectively overcome the limitations of individual weak supervision signals, thereby advancing the development of deployable, annotation-efficient medical AI systems. Code is available at: https://github.com/jingkunchen/FGI.

[438] MambaMoE: Mixture-of-Spectral-Spatial-Experts State Space Model for Hyperspectral Image Classification

Yichu Xu, Di Wang, Hongzan Jiao, Lefei Zhang, Liangpei Zhang

Main category: cs.CV

TL;DR: MambaMoE is a novel spectral-spatial Mixture-of-Experts framework for hyperspectral image classification that addresses directional modeling heterogeneity through adaptive expert activation and uncertainty-guided corrective learning, achieving state-of-the-art performance.

DetailsMotivation: Existing Mamba-based approaches overlook directional modeling heterogeneity across different land-cover types, leading to limited classification performance in hyperspectral image classification.

Method: Proposes MambaMoE with Mixture of Mamba Expert Block (MoMEB) for adaptive spectral-spatial feature modeling via sparse expert activation, and uncertainty-guided corrective learning (UGCL) strategy that samples supervision from high-uncertainty regions.

Result: Extensive experiments on multiple public HSI benchmark datasets show MambaMoE achieves state-of-the-art performance in both classification accuracy and computational efficiency compared to existing advanced methods.

Conclusion: MambaMoE represents the first MoE-based approach in HSI classification domain and demonstrates superior performance through adaptive expert modeling and uncertainty-guided learning.

Abstract: Mamba-based models have recently demonstrated significant potential in hyperspectral image (HSI) classification, primarily due to their ability to perform contextual modeling with linear computational complexity. However, existing Mamba-based approaches often overlook the directional modeling heterogeneity across different land-cover types, leading to limited classification performance. To address these limitations, we propose MambaMoE, a novel spectral-spatial Mixture-of-Experts (MoE) framework, which represents the first MoE-based approach in the HSI classification domain. Specifically, we design a Mixture of Mamba Expert Block (MoMEB) that performs adaptive spectral-spatial feature modeling via a sparse expert activation mechanism. Additionally, we introduce an uncertainty-guided corrective learning (UGCL) strategy that encourages the model to focus on complex regions prone to prediction ambiguity. This strategy dynamically samples supervision signals from regions with high predictive uncertainty, guiding the model to adaptively refine feature representations and thereby enhancing its focus on challenging areas. Extensive experiments conducted on multiple public HSI benchmark datasets show that MambaMoE achieves state-of-the-art performance in both classification accuracy and computational efficiency compared to existing advanced methods, particularly Mamba-based ones. The code will be available online at https://github.com/YichuXu/MambaMoE.

[439] FreeInsert: Disentangled Text-Guided Object Insertion in 3D Gaussian Scene without Spatial Priors

Chenxi Li, Weijie Wang, Qiang Li, Bruno Lepri, Nicu Sebe, Weizhi Nie

Main category: cs.CV

TL;DR: FreeInsert is a framework for text-driven 3D object insertion that uses foundation models to enable flexible object placement without requiring spatial priors like masks or bounding boxes.

DetailsMotivation: Existing methods rely on spatial priors and struggle with consistency, limiting flexibility and scalability for real-world 3D scene editing applications.

Method: Uses MLLM-based parser to extract structured semantics from text, leverages foundation models to disentangle object generation from spatial placement, and employs hierarchical refinement with spatial reasoning for pose initialization and visual enhancement.

Result: Achieves semantically coherent, spatially precise, and visually realistic 3D insertions without spatial priors, providing a user-friendly editing experience.

Conclusion: FreeInsert offers a flexible and scalable solution for text-driven 3D object insertion by leveraging foundation models to overcome limitations of existing spatial-prior-dependent methods.

Abstract: Text-driven object insertion in 3D scenes is an emerging task that enables intuitive scene editing through natural language. However, existing 2D editing-based methods often rely on spatial priors such as 2D masks or 3D bounding boxes, and they struggle to ensure consistency of the inserted object. These limitations hinder flexibility and scalability in real-world applications. In this paper, we propose FreeInsert, a novel framework that leverages foundation models including MLLMs, LGMs, and diffusion models to disentangle object generation from spatial placement. This enables unsupervised and flexible object insertion in 3D scenes without spatial priors. FreeInsert starts with an MLLM-based parser that extracts structured semantics, including object types, spatial relationships, and attachment regions, from user instructions. These semantics guide both the reconstruction of the inserted object for 3D consistency and the learning of its degrees of freedom. We leverage the spatial reasoning capabilities of MLLMs to initialize object pose and scale. A hierarchical, spatially aware refinement stage further integrates spatial semantics and MLLM-inferred priors to enhance placement. Finally, the appearance of the object is improved using the inserted-object image to enhance visual fidelity. Experimental results demonstrate that FreeInsert achieves semantically coherent, spatially precise, and visually realistic 3D insertions without relying on spatial priors, offering a user-friendly and flexible editing experience.

[440] MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly

Zhaowei Wang, Wenhao Yu, Xiyu Ren, Jipeng Zhang, Yu Zhao, Rohit Saxena, Liang Cheng, Ginny Wong, Simon See, Pasquale Minervini, Yangqiu Song, Mark Steedman

Main category: cs.CV

TL;DR: MMLongBench is the first comprehensive benchmark for evaluating long-context vision-language models across diverse tasks, image types, and standardized input lengths (8K-128K tokens).

DetailsMotivation: The rapid development of long-context vision-language models (LCVLMs) capable of handling hundreds of images with interleaved text requires effective evaluation methods, but existing benchmarks are insufficient for thorough assessment.

Method: Created MMLongBench with 13,331 examples spanning 5 task categories (Visual RAG, Many-Shot ICL, etc.), covering various natural and synthetic images, delivered at 5 standardized input lengths using cross-modal tokenization.

Result: Benchmarked 46 LCVLMs revealing that: single task performance poorly predicts overall capability; both closed-source and open-source models struggle with long-context tasks; models with stronger reasoning ability perform better.

Conclusion: MMLongBench provides the essential foundation for diagnosing and advancing next-generation LCVLMs through comprehensive task coverage, diverse image types, and rigorous length control.

Abstract: The rapid extension of context windows in large vision-language models has given rise to long-context vision-language models (LCVLMs), which are capable of handling hundreds of images with interleaved text tokens in a single forward pass. In this work, we introduce MMLongBench, the first benchmark covering a diverse set of long-context vision-language tasks, to evaluate LCVLMs effectively and thoroughly. MMLongBench is composed of 13,331 examples spanning five different categories of downstream tasks, such as Visual RAG and Many-Shot ICL. It also provides broad coverage of image types, including various natural and synthetic images. To assess the robustness of the models to different input lengths, all examples are delivered at five standardized input lengths (8K-128K tokens) via a cross-modal tokenization scheme that combines vision patches and text tokens. Through a thorough benchmarking of 46 closed-source and open-source LCVLMs, we provide a comprehensive analysis of the current models’ vision-language long-context ability. Our results show that: i) performance on a single task is a weak proxy for overall long-context capability; ii) both closed-source and open-source models face challenges in long-context vision-language tasks, indicating substantial room for future improvement; iii) models with stronger reasoning ability tend to exhibit better long-context performance. By offering wide task coverage, various image types, and rigorous length control, MMLongBench provides the missing foundation for diagnosing and advancing the next generation of LCVLMs.

[441] Enhancing Transformers Through Conditioned Embedded Tokens

Hemanth Saratchandran, Simon Lucey

Main category: cs.CV

TL;DR: Transformers suffer from inherent ill-conditioning in attention blocks that hampers training efficiency. The paper introduces conditioned embedded tokens to systematically improve attention conditioning, leading to more stable and efficient training across various tasks.

DetailsMotivation: The attention mechanism in transformers suffers from inherent ill-conditioning that negatively impacts gradient-based optimization and training efficiency, despite being fundamental to transformer success.

Method: Developed a theoretical framework linking attention block conditioning to embedded token conditioning, then introduced conditioned embedded tokens - a method that systematically modifies embedded tokens to improve attention mechanism conditioning.

Result: The approach significantly mitigates ill-conditioning, leading to more stable and efficient training. Validated across various transformer architectures with consistent improvements in image classification, object detection, instance segmentation, and NLP tasks.

Conclusion: Conditioned embedded tokens provide an effective solution to transformer attention ill-conditioning, demonstrating broad applicability and effectiveness across multiple domains and architectures.

Abstract: Transformers have transformed modern machine learning, driving breakthroughs in computer vision, natural language processing, and robotics. At the core of their success lies the attention mechanism, which enables the modeling of global dependencies among input tokens. However, we reveal that the attention block in transformers suffers from inherent ill-conditioning, which hampers gradient-based optimization and leads to inefficient training. To address this, we develop a theoretical framework that establishes a direct relationship between the conditioning of the attention block and that of the embedded tokenized data. Building on this insight, we introduce conditioned embedded tokens, a method that systematically modifies the embedded tokens to improve the conditioning of the attention mechanism. Our analysis demonstrates that this approach significantly mitigates ill-conditioning, leading to more stable and efficient training. We validate our methodology across various transformer architectures, achieving consistent improvements in image classification, object detection, instance segmentation, and natural language processing, highlighting its broad applicability and effectiveness.

[442] RGB-to-Polarization Estimation: A New Task and Benchmark Study

Beibei Lin, Zifeng Yuan, Tingting Chen

Main category: cs.CV

TL;DR: This paper introduces RGB-to-polarization image estimation as a new task and establishes the first comprehensive benchmark for it, evaluating various deep learning models to infer polarization information from standard RGB images.

DetailsMotivation: Polarization images provide rich physical information missing from RGB images, but acquiring them requires additional optical components that increase cost and complexity. This work aims to bridge the gap by enabling polarization estimation from readily available RGB images.

Method: The authors leverage existing polarization datasets to create a benchmark and evaluate diverse state-of-the-art deep learning models, including both restoration-oriented and generative architectures, through extensive quantitative and qualitative analysis.

Result: The benchmark establishes the current performance ceiling for RGB-to-polarization estimation and systematically reveals the strengths and limitations of different model families (direct reconstruction vs generative synthesis, task-specific training vs large-scale pre-training).

Conclusion: This benchmark serves as a foundational resource to facilitate future research on polarization estimation from RGB inputs, with the authors providing potential directions for future work in this area.

Abstract: Polarization images provide rich physical information that is fundamentally absent from standard RGB images, benefiting a wide range of computer vision applications such as reflection separation and material classification. However, the acquisition of polarization images typically requires additional optical components, which increases both the cost and the complexity of the applications. To bridge this gap, we introduce a new task: RGB-to-polarization image estimation, which aims to infer polarization information directly from RGB images. In this work, we establish the first comprehensive benchmark for this task by leveraging existing polarization datasets and evaluating a diverse set of state-of-the-art deep learning models, including both restoration-oriented and generative architectures. Through extensive quantitative and qualitative analysis, our benchmark not only establishes the current performance ceiling of RGB-to-polarization estimation, but also systematically reveals the respective strengths and limitations of different model families – such as direct reconstruction versus generative synthesis, and task-specific training versus large-scale pre-training. In addition, we provide some potential directions for future research on polarization estimation. This benchmark is intended to serve as a foundational resource to facilitate the design and evaluation of future methods for polarization estimation from standard RGB inputs.

[443] GeoRanker: Distance-Aware Ranking for Worldwide Image Geolocalization

Pengyue Jia, Seongheon Park, Song Gao, Xiangyu Zhao, Yixuan Li

Main category: cs.CV

TL;DR: GeoRanker is a distance-aware ranking framework for worldwide image geolocalization that uses vision-language models to encode query-candidate interactions and predict geographic proximity, achieving state-of-the-art results.

DetailsMotivation: Existing two-stage geolocalization approaches rely on simplistic similarity heuristics and point-wise supervision, failing to model spatial relationships among candidates, which limits their performance.

Method: Proposes GeoRanker framework with large vision-language models to jointly encode query-candidate interactions, introduces multi-order distance loss for ranking both absolute and relative distances, and creates GeoRanking dataset for geographic ranking tasks.

Result: Achieves state-of-the-art results on IM2GPS3K and YFCC4K benchmarks, significantly outperforming current best methods.

Conclusion: GeoRanker effectively models spatial relationships in image geolocalization through distance-aware ranking and structured reasoning, demonstrating superior performance over existing approaches.

Abstract: Worldwide image geolocalization-the task of predicting GPS coordinates from images taken anywhere on Earth-poses a fundamental challenge due to the vast diversity in visual content across regions. While recent approaches adopt a two-stage pipeline of retrieving candidates and selecting the best match, they typically rely on simplistic similarity heuristics and point-wise supervision, failing to model spatial relationships among candidates. In this paper, we propose GeoRanker, a distance-aware ranking framework that leverages large vision-language models to jointly encode query-candidate interactions and predict geographic proximity. In addition, we introduce a multi-order distance loss that ranks both absolute and relative distances, enabling the model to reason over structured spatial relationships. To support this, we curate GeoRanking, the first dataset explicitly designed for geographic ranking tasks with multimodal candidate information. GeoRanker achieves state-of-the-art results on two well-established benchmarks (IM2GPS3K and YFCC4K), significantly outperforming current best methods.

[444] Constructing a 3D Scene from a Single Image

Kaizhi Zheng, Ruijian Zha, Zishuo Xu, Jing Gu, Jie Yang, Xin Eric Wang

Main category: cs.CV

TL;DR: SceneFuse-3D is a training-free framework that generates coherent 3D scenes from single top-down images using region-based generation and spatial-aware 3D inpainting to overcome resolution bottlenecks and maintain structural continuity.

DetailsMotivation: Traditional 3D scene acquisition requires expensive equipment, multi-view data, or manual modeling. While recent 3D generative models work well for objects, they struggle with full-scene generation, leading to inconsistent geometry, layout hallucinations, and low-quality meshes.

Method: Decomposes input image into overlapping regions, generates each using a pretrained 3D object generator, then applies masked rectified flow inpainting to fill missing geometry while maintaining structural continuity. Uses region-based generation for better alignment and spatial-aware 3D inpainting for coherence.

Result: Outperforms state-of-the-art baselines (Trellis, Hunyuan3D-2, TripoSG, LGM) in geometry quality, spatial coherence, and texture fidelity across diverse scenes. Achieves high-quality coherent 3D scene generation without 3D supervision or fine-tuning.

Conclusion: High-quality coherent 3D scene-level asset generation is achievable from single top-down images using a principled, training-free pipeline that overcomes resolution limitations and preserves spatial structure.

Abstract: Acquiring detailed 3D scenes typically demands costly equipment, multi-view data, or labor-intensive modeling. Therefore, a lightweight alternative, generating complex 3D scenes from a single top-down image, plays an essential role in real-world applications. While recent 3D generative models have achieved remarkable results at the object level, their extension to full-scene generation often leads to inconsistent geometry, layout hallucinations, and low-quality meshes. In this work, we introduce SceneFuse-3D, a training-free framework designed to synthesize coherent 3D scenes from a single top-down view. Our method is grounded in two principles: region-based generation to improve image-to-3D alignment and resolution, and spatial-aware 3D inpainting to ensure global scene coherence and high-quality geometry generation. Specifically, we decompose the input image into overlapping regions and generate each using a pretrained 3D object generator, followed by a masked rectified flow inpainting process that fills in missing geometry while maintaining structural continuity. This modular design allows us to overcome resolution bottlenecks and preserve spatial structure without requiring 3D supervision or fine-tuning. Extensive experiments across diverse scenes show that SceneFuse-3D outperforms state-of-the-art baselines, including Trellis, Hunyuan3D-2, TripoSG, and LGM, in terms of geometry quality, spatial coherence, and texture fidelity. Our results demonstrate that high-quality coherent 3D scene-level asset generation is achievable from a single top-down image using a principled, training-free pipeline.

[445] Circle-RoPE: Cone-like Decoupled Rotary Positional Embedding for Large Vision-Language Models

Chengcheng Wang, Jianyuan Guo, Hongguang Li, Yuchuan Tian, Ying Nie, Chang Xu, Kai Han

Main category: cs.CV

TL;DR: Circle-RoPE is a novel positional encoding method for vision-language models that eliminates cross-modal positional biases by projecting image tokens onto a ring orthogonal to text tokens, creating a cone-like structure where each text token maintains equal distance to all image tokens.

DetailsMotivation: RoPE and its variants create unintended cross-modal positional biases in VLMs by enforcing relative positional dependencies separately within text and image tokens, causing semantically consistent image tokens to have different positional encodings based on spatial location, leading to misaligned cross-modal representations.

Method: Propose Per-Token Distance metric to quantify positional encoding independence, then introduce Circle-RoPE which projects image token indices onto a ring orthogonal to the linear text token axis, forming a cone-like structure. Also use staggered strategy applying different RoPE variants across layers.

Result: Extensive experiments show the method effectively preserves spatial information from images while reducing relative positional bias, providing a more robust and flexible positional encoding framework for VLMs.

Conclusion: Circle-RoPE offers an effective solution to eliminate spurious cross-modal biases in VLMs while maintaining intra-image spatial information, creating a more balanced positional encoding framework for multimodal learning.

Abstract: Rotary Position Embedding (RoPE) is a widely adopted technique for encoding relative positional information in large language models (LLMs). However, when extended to vision-language models (VLMs), RoPE and its variants enforce relative positional dependencies separately within text and image tokens, introducing unintended cross-modal positional biases. For example, image tokens depicting semantically consistent content are assigned distinct positional encodings solely due to spatial location variations. As a result, such tokens exhibit entirely different relative positional relationships with their corresponding text tokens, ultimately leading to misaligned cross-modal representations. To address this, we propose Per-Token Distance, a simple yet effective metric for quantifying the independence of positional encodings across modalities. Informed by this analysis, we introduce Circle-RoPE, a novel encoding scheme designed to eliminate spurious cross-modal biases. Our key idea is to project image token indices onto a \emph{ring} that is orthogonal to the linear axis of text token indices, thereby forming a cone-like structure in the positional encoding space. In this configuration, each text token (point on the linear text axis) becomes the apex of a cone and maintains an equal distance to all image tokens (points on the circular image \emph{ring}), reducing artificial cross-modal biases while preserving intra-image spatial information. To further enhance performance, we propose a staggered strategy that applies different RoPE variants across layers. Extensive experiments demonstrate that our method effectively preserves spatial information from images while reducing relative positional bias, offering a more robust and flexible positional encoding framework for VLMs. The code is available at https://github.com/lose4578/CircleRoPE.

[446] ViP$^2$-CLIP: Visual-Perception Prompting with Unified Alignment for Zero-Shot Anomaly Detection

Ziteng Yang, Jingzehua Xu, Yanshu Li, Zepeng Li, Yeqiang Wang, Xinghui Li

Main category: cs.CV

TL;DR: ViP²-CLIP introduces a Visual-Perception Prompting mechanism that adaptively generates fine-grained textual prompts using global and multi-scale local visual context, eliminating the need for manual templates and class-name priors in zero-shot anomaly detection.

DetailsMotivation: Existing CLIP-based methods for zero-shot anomaly detection rely on handcrafted or static learnable prompts, which have limitations in semantic coverage and adaptability to complex anomaly variations. CLIP's sensitivity to exact class name wording also constrains prompting strategies.

Method: Proposes ViP²-CLIP with Visual-Perception Prompting (ViP-Prompt) mechanism that fuses global and multi-scale local visual context to adaptively generate fine-grained textual prompts, eliminating manual templates and class-name dependencies.

Result: Extensive experiments on 15 industrial and medical benchmarks demonstrate state-of-the-art performance and robust cross-domain generalization.

Conclusion: ViP²-CLIP effectively addresses limitations of existing CLIP-based zero-shot anomaly detection methods by enabling adaptive prompt generation that focuses on precise abnormal regions, making it valuable for scenarios with ambiguous category labels or privacy constraints.

Abstract: Zero-shot anomaly detection (ZSAD) aims to detect anomalies without any target domain training samples, relying solely on external auxiliary data. Existing CLIP-based methods attempt to activate the model’s ZSAD potential via handcrafted or static learnable prompts. The former incur high engineering costs and limited semantic coverage, whereas the latter apply identical descriptions across diverse anomaly types, thus fail to adapt to complex variations. Furthermore, since CLIP is originally pretrained on large-scale classification tasks, its anomaly segmentation quality is highly sensitive to the exact wording of class names, severely constraining prompting strategies that depend on class labels. To address these challenges, we introduce ViP$^{2}$-CLIP. The key insight of ViP$^{2}$-CLIP is a Visual-Perception Prompting (ViP-Prompt) mechanism, which fuses global and multi-scale local visual context to adaptively generate fine-grained textual prompts, eliminating manual templates and class-name priors. This design enables our model to focus on precise abnormal regions, making it particularly valuable when category labels are ambiguous or privacy-constrained. Extensive experiments on 15 industrial and medical benchmarks demonstrate that ViP$^{2}$-CLIP achieves state-of-the-art performance and robust cross-domain generalization.

[447] AutoMiSeg: Automatic Medical Image Segmentation via Test-Time Adaptation of Foundation Models

Xingjian Li, Qifeng Wu, Adithya S. Ubaradka, Yiran Ding, Colleen Que, Runmin Jiang, Jianhua Xing, Tianyang Wang, Min Xu

Main category: cs.CV

TL;DR: A zero-shot automatic medical image segmentation pipeline that combines vision-language and segmentation foundation models with test-time adaptation, achieving 69% improvement in accuracy without requiring expert annotations or prompts.

DetailsMotivation: Current deep learning methods for medical image segmentation require extensive expert effort through large annotated datasets or manual prompts for each case, which is inefficient and not scalable.

Method: Uses grounding model for initial bounding box, visual prompt boosting module to enhance prompts, and promptable segmentation model. Introduces test-time adaptation with learnable adaptors optimized via Bayesian Optimization guided by proxy validation model.

Result: Achieves 71.81 Dice Score (69% relative improvement from 42.53) across seven medical imaging datasets, performing competitively with weakly-prompted interactive foundation models.

Conclusion: The pipeline provides an annotation-efficient and scalable solution for zero-shot medical image segmentation across diverse tasks by proper decomposition and test-time adaptation.

Abstract: Medical image segmentation is vital for clinical diagnosis, yet current deep learning methods often demand extensive expert effort, i.e., either through annotating large training datasets or providing prompts at inference time for each new case. This paper introduces a zero-shot and automatic segmentation pipeline that combines off-the-shelf vision-language and segmentation foundation models. Given a medical image and a task definition (e.g., “segment the optic disc in an eye fundus image”), our method uses a grounding model to generate an initial bounding box, followed by a visual prompt boosting module that enhance the prompts, which are then processed by a promptable segmentation model to produce the final mask. To address the challenges of domain gap and result verification, we introduce a test-time adaptation framework featuring a set of learnable adaptors that align the medical inputs with foundation model representations. Its hyperparameters are optimized via Bayesian Optimization, guided by a proxy validation model without requiring ground-truth labels. Our pipeline offers an annotation-efficient and scalable solution for zero-shot medical image segmentation across diverse tasks. Our pipeline is evaluated on seven diverse medical imaging datasets and shows promising results. By proper decomposition and test-time adaptation, our fully automatic pipeline not only substantially surpasses the previously best-performing method, yielding a 69% relative improvement in accuracy (Dice Score from 42.53 to 71.81), but also performs competitively with weakly-prompted interactive foundation models.

[448] Do We Need All the Synthetic Data? Targeted Synthetic Image Augmentation via Diffusion Models

Dang Nguyen, Jiping Li, Jinghao Zheng, Baharan Mirzasoleiman

Main category: cs.CV

TL;DR: Selective synthetic data augmentation of only 30-40% of training data (specifically parts not learned early) outperforms full dataset augmentation, improving generalization by up to 2.8% across various models and datasets.

DetailsMotivation: Existing synthetic data augmentation methods struggle with diversity and require 10-30x data size increases for in-distribution performance improvements, motivating a more targeted approach.

Method: Augment only the portion of data that is not learned early in training with faithful images containing same features but different noise, rather than augmenting the entire dataset.

Result: Method boosts generalization by up to 2.8% across ResNet, ViT, ConvNeXt, and Swin Transformer on CIFAR-10/100 and TinyImageNet, with various optimizers. Applied with SGD, it outperforms SOTA optimizer SAM on CIFAR-100 and TinyImageNet.

Conclusion: Selective augmentation of poorly learned data promotes homogeneity in feature learning speed without amplifying noise, providing more effective generalization than full dataset augmentation.

Abstract: Synthetically augmenting training datasets with diffusion models has been an effective strategy for improving generalization of image classifiers. However, existing techniques struggle to ensure the diversity of generation and increase the size of the data by up to 10-30x to improve the in-distribution performance. In this work, we show that synthetically augmenting part of the data that is not learned early in training with faithful images-containing same features but different noise-outperforms augmenting the entire dataset. By analyzing a two-layer CNN, we prove that this strategy improves generalization by promoting homogeneity in feature learning speed without amplifying noise. Our extensive experiments show that by augmenting only 30%-40% of the data, our method boosts generalization by up to 2.8% in a variety of scenarios, including training ResNet, ViT, ConvNeXt, and Swin Transformer on CIFAR-10/100, and TinyImageNet, with various optimizers including SGD and SAM. Notably, our method applied with SGD outperforms the SOTA optimizer, SAM, on CIFAR-100 and TinyImageNet.

[449] CryoCCD: Conditional Cycle-consistent Diffusion with Biophysical Modeling for Cryo-EM Synthesis

Runmin Jiang, Genpei Zhang, Yuntian Yang, Siqi Wu, Minhao Wu, Wanyue Feng, Yizhou Zhao, Xi Xiao, Xiao Wang, Tianyang Wang, Xingjian Li, Muyuan Chen, Min Xu

Main category: cs.CV

TL;DR: CryoCCD is a synthetic data generation framework for cryo-EM that combines biophysical modeling with conditional diffusion models to create realistic micrographs, addressing the scarcity of annotated datasets.

DetailsMotivation: The development of cryo-EM processing tools is limited by the lack of high-quality annotated datasets, and existing synthetic data approaches fail to properly model biological heterogeneity and realistic noise.

Method: CryoCCD unifies versatile biophysical modeling with a conditional cycle-consistent diffusion model, enhanced with mask-guided contrastive learning to ensure realistic noise while preserving structural fidelity.

Result: Extensive experiments show CryoCCD generates structurally faithful micrographs, improves particle picking and pose estimation, and outperforms state-of-the-art baselines while generalizing to held-out protein families.

Conclusion: CryoCCD provides an effective solution for synthetic cryo-EM data generation that captures authentic biological organization and realistic noise, enabling better development of cryo-EM processing tools.

Abstract: Single-particle cryo-electron microscopy (cryo-EM) has become a cornerstone of structural biology, enabling near-atomic resolution analysis of macromolecules through advanced computational methods. However, the development of cryo-EM processing tools is constrained by the scarcity of high-quality annotated datasets. Synthetic data generation offers a promising alternative, but existing approaches lack thorough biophysical modeling of heterogeneity and fail to reproduce the complex noise observed in real imaging. To address these limitations, we present CryoCCD, a synthesis framework that unifies versatile biophysical modeling with the first conditional cycle-consistent diffusion model tailored for cryo-EM. The biophysical engine provides multi-functional generation capabilities to capture authentic biological organization, and the diffusion model is enhanced with cycle consistency and mask-guided contrastive learning to ensure realistic noise while preserving structural fidelity. Extensive experiments demonstrate that CryoCCD generates structurally faithful micrographs, enhances particle picking and pose estimation, as well as achieves superior performance over state-of-the-art baselines, while also generalizing effectively to held-out protein families.

[450] DS-VTON: An Enhanced Dual-Scale Coarse-to-Fine Framework for Virtual Try-On

Xianbing Sun, Yan Hong, Jiahui Zhan, Jun Lan, Huijia Zhu, Weiqiang Wang, Liqing Zhang, Jianfu Zhang

Main category: cs.CV

TL;DR: DS-VTON is a dual-scale coarse-to-fine framework for virtual try-on that addresses structural alignment and texture preservation through a two-stage approach with blend-refine diffusion, achieving state-of-the-art performance without requiring segmentation masks.

DetailsMotivation: Existing virtual try-on methods struggle to simultaneously achieve accurate garment-body alignment and preserve fine-grained garment textures and patterns, which are both crucial for realistic virtual try-on results.

Method: A dual-scale coarse-to-fine framework with two stages: first generates low-resolution try-on results for robust structural alignment, then uses a blend-refine diffusion process to reconstruct high-resolution outputs by refining scale residuals through noise-image blending.

Result: DS-VTON achieves state-of-the-art performance and significantly surpasses prior methods in both structural alignment and texture fidelity across multiple standard virtual try-on benchmarks.

Conclusion: The proposed DS-VTON framework effectively addresses the core challenges of virtual try-on through its dual-scale approach and mask-free generation strategy, demonstrating superior performance in both structural alignment and texture preservation.

Abstract: Despite recent progress, most existing virtual try-on methods still struggle to simultaneously address two core challenges: accurately aligning the garment image with the target human body, and preserving fine-grained garment textures and patterns. These two requirements map directly onto a coarse-to-fine generation paradigm, where the coarse stage handles structural alignment and the fine stage recovers rich garment details. Motivated by this observation, we propose DS-VTON, an enhanced dual-scale coarse-to-fine framework that tackles the try-on problem more effectively. DS-VTON consists of two stages: the first stage generates a low-resolution try-on result to capture the semantic correspondence between garment and body, where reduced detail facilitates robust structural alignment. In the second stage, a blend-refine diffusion process reconstructs high-resolution outputs by refining the residual between scales through noise-image blending, emphasizing texture fidelity and effectively correcting fine-detail errors from the low-resolution stage. In addition, our method adopts a fully mask-free generation strategy, eliminating reliance on human parsing maps or segmentation masks. Extensive experiments show that DS-VTON not only achieves state-of-the-art performance but consistently and significantly surpasses prior methods in both structural alignment and texture fidelity across multiple standard virtual try-on benchmarks.

[451] Resolving Task Objective Conflicts in Unified Model via Task-Aware Mixture-of-Experts

Jiaxing Zhang, Hao Tang

Main category: cs.CV

TL;DR: UTAMoE is a novel framework that uses Task-Aware Mixture-of-Experts to decouple internal autoregressive modules, resolving task objective conflicts between understanding and generation in multimodal LLMs through task-specific optimization subpaths and two-stage training.

DetailsMotivation: Address intrinsic Task Objective Conflicts between high-level semantic abstraction in understanding and fine-grained detail preservation in generation within unified multimodal LLMs, which existing solutions fail to resolve due to inherent autoregressive architecture limitations.

Method: Propose UTAMoE framework that decouples internal AR modules via Task-Aware MoE Layer to create task-specific optimization subpaths, combined with a novel Two-Stage Training Strategy to enhance task differentiation while maintaining coordination.

Result: Extensive experiments show UTAMoE mitigates task objective conflicts and achieves state-of-the-art performance across various multimodal benchmarks, with visualizations and ablation studies validating effectiveness.

Conclusion: The proposed UTAMoE framework successfully resolves task objective conflicts in multimodal LLMs through internal AR module decoupling and task-aware optimization, demonstrating superior performance over existing approaches.

Abstract: Unified multimodal large language models (MLLMs) based on end-to-end autoregressive (AR) transformers effectively integrate both understanding and generation tasks within a single framework. However, intrinsic Task Objective Conflicts between high-level semantic abstraction in understanding and fine-grained detail preservation in generation pose significant challenges, often leading to suboptimal trade-offs and task interference. Existing solutions, such as decoupling shared visual encoders, fall short of fundamentally resolving these conflicts due to inherent AR architecture. In this paper, we propose a novel approach that decouples internal components of AR to resolve task objective conflicts. Specifically, we design UTAMoE, a Unified Task-Aware Mixture-of-Experts (MoE) framework that decouples internal AR modules via a Task-Aware MoE Layer to create task-specific optimization subpaths. To enhance task differentiation while maintaining overall coordination, we introduce a novel Two-Stage Training Strategy. Extensive experiments on multimodal benchmarks demonstrate that UTAMoE mitigates task objective conflicts, achieving state-of-the-art performance across various tasks. Visualizations and ablation studies further validate the effectiveness of our approach.

[452] A Neurosymbolic Agent System for Compositional Visual Reasoning

Yichang Xu, Gaowen Liu, Ramana Rao Kompella, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, Zachary Yahn, Ling Liu

Main category: cs.CV

TL;DR: VLAgent is a neuro-symbolic system that combines LLM planning with symbolic execution for compositional visual reasoning, featuring syntax-semantic parsing and execution verification to improve accuracy.

DetailsMotivation: Existing vision-language models struggle with compositional visual reasoning tasks that require complex multi-step reasoning about visual content.

Method: Two-stage neuro-symbolic system: front-end LLM generates structured reasoning plans, back-end executes them using neural models and symbolic functions, with SS-parser for error detection/repair and execution verifier for stepwise validation.

Result: Outperforms a dozen state-of-the-art visual reasoning models on six visual benchmarks.

Conclusion: VLAgent demonstrates that neuro-symbolic approaches with structured planning and verification mechanisms can significantly improve compositional visual reasoning capabilities.

Abstract: The advancement in large language models (LLMs) and large vision models has fueled the rapid progress in multi-modal vision-language reasoning capabilities. However, existing vision-language models (VLMs) remain challenged by compositional visual reasoning. This paper presents VLAgent, a neuro-symbolic approach to developing a Vision-Language Agent system for efficient compositional visual reasoning with three novel features. First, VLAgent develops an interpretable visualization-enhanced two-stage neuro-symbolic reasoning system. The first stage is managed by a front-end engine that generates a structured visual reasoning plan (symbolic program script) for each compositional visual reasoning task by utilizing a pre-trained LLM powered with few-shot chain-of-thought in-context learning. The second stage is managed by a high-performance back-end engine. It transforms the planning script into executable code based on visual input (image or video) and the combination of neural models and symbolic functions and then performs a sequence of actions for the compositional visual reason task. Second, to ensure and enhance the quality of mapping the logic plan to a sequence of executable instructions, VLAgent introduces the SS-parser, which examines the syntax and semantic correctness of the planning script, detects and repairs the logic errors found in the LLM-generated logic plan before generating the executable program. Third, VLAgent introduces the execution verifier in critical reasoning steps to validate and refine its compositional reasoning results in a stepwise manner, for example, ensemble methods for critical visual reasoning and caption analysis for low-confidence compositional reasoning. Extensive experiments on six visual benchmarks compared to a dozen SoTA visual reasoning models show that VLAgent outperforms existing representative approaches to compositional visual reasoning.

[453] RoboSwap: A GAN-driven Video Diffusion Framework For Unsupervised Robot Arm Swapping

Yang Bai, Liudi Yang, George Eskandar, Fengyi Shen, Dong Chen, Mohammad Altillawi, Ziyuan Liu, Gitta Kutyniok

Main category: cs.CV

TL;DR: RoboSwap is a video editing framework that swaps robotic arms in videos using unpaired data from diverse environments, combining GANs for arm translation and diffusion models for refinement to enhance cross-embodiment learning.

DetailsMotivation: Address the scarcity of diverse, high-quality datasets for video-conditioned robotic learning and enable cross-platform generalization by allowing robotic arm swapping without requiring paired video demonstrations in the same environmental settings.

Method: Proposes a novel video editing pipeline that integrates GANs and diffusion models. First segments robotic arms from backgrounds, trains an unpaired GAN to translate one robotic arm to another, blends the translated arm with original background, and refines with diffusion model for coherence and motion realism.

Result: Outperforms state-of-the-art video and image editing models on three benchmarks in terms of structural coherence and motion consistency, providing a robust solution for generating reliable cross-embodiment data.

Conclusion: RoboSwap offers an effective framework for cross-embodiment learning by leveraging unpaired data and combining the advantages of both GANs and diffusion models, reducing data collection needs while maintaining high quality video synthesis.

Abstract: Recent advancements in generative models have revolutionized video synthesis and editing. However, the scarcity of diverse, high-quality datasets continues to hinder video-conditioned robotic learning, limiting cross-platform generalization. In this work, we address the challenge of swapping a robotic arm in one video with another: a key step for crossembodiment learning. Unlike previous methods that depend on paired video demonstrations in the same environmental settings, our proposed framework, RoboSwap, operates on unpaired data from diverse environments, alleviating the data collection needs. RoboSwap introduces a novel video editing pipeline integrating both GANs and diffusion models, combining their isolated advantages. Specifically, we segment robotic arms from their backgrounds and train an unpaired GAN model to translate one robotic arm to another. The translated arm is blended with the original video background and refined with a diffusion model to enhance coherence, motion realism and object interaction. The GAN and diffusion stages are trained independently. Our experiments demonstrate that RoboSwap outperforms state-of-the-art video and image editing models on three benchmarks in terms of both structural coherence and motion consistency, thereby offering a robust solution for generating reliable, cross-embodiment data in robotic learning.

[454] WetCat: Enabling Automated Skill Assessment in Wet-Lab Cataract Surgery Videos

Negin Ghamsarian, Raphael Sznitman, Klaus Schoeffmann, Jens Kowal

Main category: cs.CV

TL;DR: WetCat is the first wet-lab cataract surgery video dataset for automated skill assessment, featuring phase annotations and semantic segmentations to enable AI-driven evaluation tools aligned with clinical metrics.

DetailsMotivation: Traditional wet-lab surgical training relies on manual evaluations that are labor-intensive, time-consuming, and subjective. There's a need for automated, objective skill assessment in controlled training environments.

Method: Created WetCat dataset with high-resolution recordings of cataract surgeries on artificial eyes, including comprehensive phase annotations and semantic segmentations of key anatomical structures during capsulorhexis and phacoemulsification phases.

Result: Developed a publicly available dataset that enables interpretable AI-driven evaluation tools for surgical skill assessment, supporting standardized clinical metrics and workflow analysis.

Conclusion: WetCat establishes a foundation for objective, scalable surgical education and sets a new benchmark for automated skill assessment in ophthalmology training.

Abstract: To meet the growing demand for systematic surgical training, wet-lab environments have become indispensable platforms for hands-on practice in ophthalmology. Yet, traditional wet-lab training depends heavily on manual performance evaluations, which are labor-intensive, time-consuming, and often subject to variability. Recent advances in computer vision offer promising avenues for automated skill assessment, enhancing both the efficiency and objectivity of surgical education. Despite notable progress in ophthalmic surgical datasets, existing resources predominantly focus on real surgeries or isolated tasks, falling short of supporting comprehensive skill evaluation in controlled wet-lab settings. To address these limitations, we introduce WetCat, the first dataset of wet-lab cataract surgery videos specifically curated for automated skill assessment. WetCat comprises high-resolution recordings of surgeries performed by trainees on artificial eyes, featuring comprehensive phase annotations and semantic segmentations of key anatomical structures. These annotations are meticulously designed to facilitate skill assessment during the critical capsulorhexis and phacoemulsification phases, adhering to standardized surgical skill assessment frameworks. By focusing on these essential phases, WetCat enables the development of interpretable, AI-driven evaluation tools aligned with established clinical metrics. This dataset lays a strong foundation for advancing objective, scalable surgical education and sets a new benchmark for automated workflow analysis and skill assessment in ophthalmology training. The dataset and annotations are publicly available in Synapse.

[455] Attention, Please! Revisiting Attentive Probing Through the Lens of Efficiency

Bill Psomas, Dionysis Christopoulos, Eirini Baltzi, Ioannis Kakogeorgiou, Tilemachos Aravanis, Nikos Komodakis, Konstantinos Karantzalos, Yannis Avrithis, Giorgos Tolias

Main category: cs.CV

TL;DR: Efficient Probing (EP) is proposed as a multi-query cross-attention mechanism that outperforms linear probing and prior attentive probing methods across multiple benchmarks while being more parameter-efficient.

DetailsMotivation: Standard linear probing fails to adequately evaluate models that optimize patch-level representations rather than global representations, and existing attentive probing methods suffer from excessive parameterization and poor computational efficiency.

Method: Proposes Efficient Probing (EP) - a simple multi-query cross-attention mechanism that eliminates redundant projections and reduces trainable parameters while selectively aggregating patch-level features.

Result: EP outperforms linear probing and prior attentive probing approaches across seven benchmarks, generalizes well to diverse pre-training paradigms, and delivers strong low-shot and layer-wise gains.

Conclusion: Efficient Probing provides an effective alternative to linear probing, uncovering emerging properties like complementary attention maps that open new directions for leveraging probing beyond protocol design.

Abstract: As fine-tuning becomes increasingly impractical at scale, probing is emerging as the preferred evaluation protocol. Yet, the standard linear probing fails to adequately reflect the potential of models whose pre-training optimizes representations of patch tokens rather than an explicit global representation. This motivates the need for attentive probing, an alternative that uses attention to selectively aggregate patch-level features. Despite its growing adoption, attentive probing remains under-explored, with existing methods suffering from excessive parameterization and poor computational efficiency. In this work, we revisit attentive probing through the lens of the accuracy vs. parameter efficiency trade-off. We present the first comprehensive study of existing methods, analyzing their design choices and benchmarking their performance. Building on this, we propose efficient probing (EP), a simple yet effective multi-query cross-attention mechanism that eliminates redundant projections and reduces the number of trainable parameters. Despite its simplicity, EP outperforms linear probing and prior attentive probing approaches across seven benchmarks, generalizes well to diverse pre-training paradigms, and delivers strong low-shot and layer-wise gains. Beyond evaluation, our analysis uncovers emerging properties of EP, such as complementary attention maps, which open new directions for leveraging probing beyond protocol design. Code available at https://github.com/billpsomas/efficient-probing.

[456] SIRI-Bench: Challenging VLMs’ Spatial Intelligence through Complex Reasoning Tasks

Zijian Song, Xiaoxin Lin, Qiuming Huang, Guangrun Wang, Liang Lin

Main category: cs.CV

TL;DR: SIRI-Bench is a new benchmark with 9,000 video-question-answer triplets that evaluates Vision-Language Models’ structural spatial intelligence through spatial-grounded reasoning tasks in realistic 3D scenes.

DetailsMotivation: While LLMs have advanced through reinforcement learning on reasoning tasks, spatial intelligence in VLMs remains underexplored despite being fundamental for real-world interaction.

Method: Developed an Automatic Scene Creation Engine using collaborative LLM agents to translate abstract mathematical problems into faithful 3D scenes, creating 9,000 video-question-answer triplets embedded in realistic 3D environments.

Result: State-of-the-art VLMs struggle significantly on SIRI-Bench, highlighting the challenge of structural spatial reasoning and the gap in current VLM capabilities.

Conclusion: The study aims to bring attention to spatially grounded reasoning and advance VLMs in visual problem-solving by providing a comprehensive benchmark for evaluating structural spatial intelligence.

Abstract: Large Language Models (LLMs) have undergone rapid progress, largely attributed to reinforcement learning on complex reasoning tasks. In contrast, while spatial intelligence is fundamental for Vision-Language Models (VLMs) in real-world interaction, the systematic study of their complex spatial reasoning remains underexplored. To bridge this gap, we introduce SIRI-Bench, a benchmark designed to evaluate VLMs’ structural spatial intelligence through spatial-grounded reasoning tasks. SIRI-Bench comprises 9,000 video-question-answer triplets, where each problem is embedded in a realistic 3D scene. The benchmark is carefully designed so that solving each problem requires both spatial comprehension and structural reasoning. To facilitate large-scale data synthesis, we develop an Automatic Scene Creation Engine that employs collaborative LLM agents to translate abstract mathematical problems into faithful 3D scenes. Experimental results reveal that state-of-the-art VLMs struggle significantly on SIRI-Bench, underscoring the challenge of structural spatial reasoning. We hope that our study will bring researchers’ attention to spatially grounded reasoning and advance VLMs in visual problem-solving.

[457] VisualChef: Generating Visual Aids in Cooking via Mask Inpainting

Oleh Kuzyk, Zuoyue Li, Marc Pollefeys, Xi Wang

Main category: cs.CV

TL;DR: VisualChef generates contextual cooking visuals using mask-based grounding to show action execution and results while preserving environment consistency, outperforming existing methods.

DetailsMotivation: Cooking requires visual guidance but existing recipe images/videos lack consistency in focus, tools, and setup, making the process challenging without proper visual aids.

Method: Uses mask-based visual grounding to identify action-relevant objects and classify them for targeted modifications. Includes automated pipeline to extract high-quality initial, action, and final state frames from videos.

Result: Shows improvements over state-of-the-art methods in quantitative and qualitative evaluations on three egocentric video datasets.

Conclusion: VisualChef effectively generates contextual cooking visuals that depict both action execution and resulting appearance while maintaining environmental consistency through simplified alignment approach.

Abstract: Cooking requires not only following instructions but also understanding, executing, and monitoring each step - a process that can be challenging without visual guidance. Although recipe images and videos offer helpful cues, they often lack consistency in focus, tools, and setup. To better support the cooking process, we introduce VisualChef, a method for generating contextual visual aids tailored to cooking scenarios. Given an initial frame and a specified action, VisualChef generates images depicting both the action’s execution and the resulting appearance of the object, while preserving the initial frame’s environment. Previous work aims to integrate knowledge extracted from large language models by generating detailed textual descriptions to guide image generation, which requires fine-grained visual-textual alignment and involves additional annotations. In contrast, VisualChef simplifies alignment through mask-based visual grounding. Our key insight is identifying action-relevant objects and classifying them to enable targeted modifications that reflect the intended action and outcome while maintaining a consistent environment. In addition, we propose an automated pipeline to extract high-quality initial, action, and final state frames. We evaluate VisualChef quantitatively and qualitatively on three egocentric video datasets and show its improvements over state-of-the-art methods.

[458] Light of Normals: Unified Feature Representation for Universal Photometric Stereo

Hong Li, Houyuan Chen, Chongjie Ye, Zhaoxi Chen, Bohan Li, Shaocong Xu, Xianda Guo, Xuhui Liu, Yikai Wang, Baochang Zhang, Satoshi Ikehata, Boxin Shi, Anyi Rao, Hao Zhao

Main category: cs.CV

TL;DR: LINO UniPS introduces light register tokens and interleaved attention blocks for better illumination-normal decoupling, plus wavelet-based architecture and normal-gradient loss for high-frequency detail preservation, achieving SOTA results on universal photometric stereo.

DetailsMotivation: Current universal photometric stereo methods cannot guarantee illumination and normal information decoupling, and tend to lose high-frequency geometric details.

Method: Light Register Tokens with alignment supervision, Interleaved Attention Blocks with global cross-image attention, Wavelet-based Dual-branch Architecture, and Normal-gradient Perception Loss. Also introduces PS-Verse dataset and curriculum training.

Result: Achieves state-of-the-art results on DiLiGenT and Luces benchmarks, shows stronger generalization to real materials, and improved efficiency. Ablations confirm effectiveness of proposed components.

Conclusion: LINO UniPS successfully addresses illumination-normal decoupling and detail preservation challenges in universal photometric stereo through novel architectural designs and training strategies.

Abstract: Universal photometric stereo (PS) is defined by two factors: it must (i) operate under arbitrary, unknown lighting conditions and (ii) avoid reliance on specific illumination models. Despite progress (e.g., SDM UniPS), two challenges remain. First, current encoders cannot guarantee that illumination and normal information are decoupled. To enforce decoupling, we introduce LINO UniPS with two key components: (i) Light Register Tokens with light alignment supervision to aggregate point, direction, and environment lights; (ii) Interleaved Attention Block featuring global cross-image attention that takes all lighting conditions together so the encoder can factor out lighting while retaining normal-related evidence. Second, high-frequency geometric details are easily lost. We address this with (i) a Wavelet-based Dual-branch Architecture and (ii) a Normal-gradient Perception Loss. These techniques yield a unified feature space in which lighting is explicitly represented by register tokens, while normal details are preserved via wavelet branch. We further introduce PS-Verse, a large-scale synthetic dataset graded by geometric complexity and lighting diversity, and adopt curriculum training from simple to complex scenes. Extensive experiments show new state-of-the-art results on public benchmarks (e.g., DiLiGenT, Luces), stronger generalization to real materials, and improved efficiency; ablations confirm that Light Register Tokens + Interleaved Attention Block drive better feature decoupling, while Wavelet-based Dual-branch Architecture + Normal-gradient Perception Loss recover finer details.

[459] ImplicitQA: Going beyond frames towards Implicit Video Reasoning

Sirnam Swetha, Rohit Gupta, Parth Parag Kulkarni, David G Shatwell, Jeffrey A Chan Santiago, Nyle Siddiqui, Joseph Fioresi, Mubarak Shah

Main category: cs.CV

TL;DR: ImplicitQA is a new VideoQA benchmark that tests models on implicit reasoning in creative videos, where answers require inference beyond explicit visual content. Current models perform poorly on this benchmark, revealing their limitations in human-like understanding.

DetailsMotivation: Current VideoQA benchmarks focus on explicit visual content, but creative videos require implicit reasoning about omitted information, motives, and relationships across discontinuous frames - a capability humans excel at but current models lack.

Method: Created ImplicitQA benchmark with 1K QA pairs from 1K creative video clips across 15 genres and 7 decades. Questions systematically cover 9 reasoning dimensions including spatial reasoning, motion, causal reasoning, social interactions, and more. Annotations were crafted by authors, validated by multiple annotators, and benchmarked against human performance.

Result: Evaluation of 11 leading VideoQA models showed consistent and significant performance degradation on ImplicitQA, demonstrating their reliance on surface-level visual cues and inability to perform complex implicit reasoning.

Conclusion: ImplicitQA exposes critical limitations in current VideoQA models’ reasoning capabilities and highlights the need for models that can perform human-like implicit reasoning across discontinuous visual contexts in creative content.

Abstract: Video Question Answering (VideoQA) has made significant strides by leveraging multimodal learning to align visual and textual modalities. However, current benchmarks overwhelmingly focus on questions answerable through explicit visual content - actions, objects, and events directly observable within individual frames or short clips. In contrast, creative and cinematic videos - such as movies, TV shows, and narrative-driven content - employ storytelling techniques that deliberately omit certain depictions, requiring viewers to infer motives, relationships across discontinuous frames with disjoint visual contexts. Humans naturally excel at such implicit reasoning, seamlessly integrating information across time and context to construct coherent narratives. Yet current benchmarks fail to capture this essential dimension of human-like understanding. To bridge this gap, we present ImplicitQA, a novel benchmark specifically designed to test VideoQA models on human-like implicit reasoning. ImplicitQA comprises 1K meticulously annotated QA pairs drawn from 1K high-quality creative video clips covering 15 genres across 7 decades of content. Questions are systematically categorized into nine key reasoning dimensions: lateral and vertical spatial reasoning, depth and proximity, viewpoint and visibility, motion and trajectory, causal and motivational reasoning, social interactions, physical context, and inferred counting. These annotations are deliberately challenging, crafted by authors, validated through multiple annotators, and benchmarked against human performance to ensure high quality. Our extensive evaluations on 11 leading VideoQA models reveals consistent and significant performance degradation, underscoring their reliance on surface-level visual cues and highlighting the difficulty of implicit reasoning. https://huggingface.co/datasets/ucf-crcv/ImplicitQA.

[460] VSRM: A Robust Mamba-Based Framework for Video Super-Resolution

Dinh Phu Tran, Dao Duy Hung, Daeyoung Kim

Main category: cs.CV

TL;DR: VSRM is a novel video super-resolution framework that leverages Mamba’s long-sequence modeling capabilities to overcome limitations of CNN and Transformer methods, achieving state-of-the-art results through spatio-temporal Mamba blocks, deformable alignment, and frequency domain optimization.

DetailsMotivation: Current CNN-based methods have limited receptive fields while Transformers suffer from quadratic complexity, making long-sequence processing challenging in video super-resolution. Mamba offers promising alternatives with linear complexity and large receptive fields.

Method: Proposes VSRM framework with Spatial-to-Temporal and Temporal-to-Spatial Mamba blocks for long-range spatio-temporal feature extraction. Introduces Deformable Cross-Mamba Alignment for dynamic frame alignment and Frequency Charbonnier-like loss to minimize frequency domain gaps.

Result: VSRM achieves state-of-the-art results on diverse benchmarks, demonstrating superior performance in video super-resolution tasks compared to existing methods.

Conclusion: VSRM establishes itself as a solid foundation for future video super-resolution research by effectively leveraging Mamba’s capabilities for efficient long-sequence modeling and high-quality reconstruction.

Abstract: Video super-resolution remains a major challenge in low-level vision tasks. To date, CNN- and Transformer-based methods have delivered impressive results. However, CNNs are limited by local receptive fields, while Transformers struggle with quadratic complexity, posing challenges for processing long sequences in VSR. Recently, Mamba has drawn attention for its long-sequence modeling, linear complexity, and large receptive fields. In this work, we propose VSRM, a novel \textbf{V}ideo \textbf{S}uper-\textbf{R}esolution framework that leverages the power of \textbf{M}amba. VSRM introduces Spatial-to-Temporal Mamba and Temporal-to-Spatial Mamba blocks to extract long-range spatio-temporal features and enhance receptive fields efficiently. To better align adjacent frames, we propose Deformable Cross-Mamba Alignment module. This module utilizes a deformable cross-mamba mechanism to make the compensation stage more dynamic and flexible, preventing feature distortions. Finally, we minimize the frequency domain gaps between reconstructed and ground-truth frames by proposing a simple yet effective Frequency Charbonnier-like loss that better preserves high-frequency content and enhances visual quality. Through extensive experiments, VSRM achieves state-of-the-art results on diverse benchmarks, establishing itself as a solid foundation for future research.

[461] Comprehensive Evaluation of Large Multimodal Models for Nutrition Analysis: A New Benchmark Enriched with Contextual Metadata

Bruce Coburn, Jiangpeng He, Megan E. Rollo, Satvinder S. Dhaliwal, Deborah A. Kerr, Fengqing Zhu

Main category: cs.CV

TL;DR: This paper investigates how contextual metadata (location, time, food items) enhances Large Multimodal Models’ performance in nutrition analysis, introduces ACETADA dataset, and shows metadata integration reduces prediction errors.

DetailsMotivation: Existing work primarily evaluates proprietary models like GPT-4, leaving other LLMs underexplored, and the influence of contextual metadata integration with reasoning modifiers remains largely uncharted.

Method: Evaluated eight LMMs (four open-weight, four closed-weight) using contextual metadata from GPS coordinates (location/venue type), timestamps (meal/day type), and food items. Tested various reasoning modifiers including Chain-of-Thought, Multimodal Chain-of-Thought, Scale Hint, Few-Shot, and Expert Persona.

Result: Integrating contextual metadata significantly reduces Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE) in predicted nutritional values compared to straightforward prompting with images alone.

Conclusion: Context-aware LMMs have significant potential for improved nutrition analysis, and intelligent metadata integration enhances the efficacy of reasoning modifiers.

Abstract: Large Multimodal Models (LMMs) are increasingly applied to meal images for nutrition analysis. However, existing work primarily evaluates proprietary models, such as GPT-4. This leaves the broad range of LLMs underexplored. Additionally, the influence of integrating contextual metadata and its interaction with various reasoning modifiers remains largely uncharted. This work investigates how interpreting contextual metadata derived from GPS coordinates (converted to location/venue type), timestamps (transformed into meal/day type), and the food items present can enhance LMM performance in estimating key nutritional values. These values include calories, macronutrients (protein, carbohydrates, fat), and portion sizes. We also introduce \textbf{ACETADA}, a new food-image dataset slated for public release. This open dataset provides nutrition information verified by the dietitian and serves as the foundation for our analysis. Our evaluation across eight LMMs (four open-weight and four closed-weight) first establishes the benefit of contextual metadata integration over straightforward prompting with images alone. We then demonstrate how this incorporation of contextual information enhances the efficacy of reasoning modifiers, such as Chain-of-Thought, Multimodal Chain-of-Thought, Scale Hint, Few-Shot, and Expert Persona. Empirical results show that integrating metadata intelligently, when applied through straightforward prompting strategies, can significantly reduce the Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE) in predicted nutritional values. This work highlights the potential of context-aware LMMs for improved nutrition analysis.

[462] Divergence Minimization Preference Optimization for Diffusion Model Alignment

Binxu Li, Minkai Xu, Jiaqi Han, Meihua Dang, Stefano Ermon

Main category: cs.CV

TL;DR: DMPO is a novel preference optimization method for diffusion models that minimizes reverse KL divergence, outperforming existing techniques in aligning models with human preferences.

DetailsMotivation: Existing preference optimization methods for diffusion models suffer from suboptimal mean-seeking optimization, and there's a need for more principled alignment approaches inspired by language model advancements.

Method: Divergence Minimization Preference Optimization (DMPO) - a method that aligns diffusion models by minimizing reverse KL divergence, which asymptotically matches the optimization direction of original reinforcement learning.

Result: DMPO consistently outperforms all baseline models across different base models and test sets, achieving the best PickScore in every case, demonstrating superior alignment with desired outputs.

Conclusion: DMPO provides a robust and elegant pathway for preference alignment in diffusion models, effectively bridging principled theory with practical performance improvements.

Abstract: Diffusion models have achieved remarkable success in generating realistic and versatile images from text prompts. Inspired by the recent advancements of language models, there is an increasing interest in further improving the models by aligning with human preferences. However, we investigate alignment from a divergence minimization perspective and reveal that existing preference optimization methods are typically trapped in suboptimal mean-seeking optimization. In this paper, we introduce Divergence Minimization Preference Optimization (DMPO), a novel and principled method for aligning diffusion models by minimizing reverse KL divergence, which asymptotically enjoys the same optimization direction as original RL. We provide rigorous analysis to justify the effectiveness of DMPO and conduct comprehensive experiments to validate its empirical strength across both human evaluations and automatic metrics. Our extensive results show that diffusion models fine-tuned with DMPO can consistently outperform or match existing techniques, specifically consistently outperforming all baseline models across different base models and test sets, achieving the best PickScore in every case, demonstrating the method’s superiority in aligning generative behavior with desired outputs. Overall, DMPO unlocks a robust and elegant pathway for preference alignment, bridging principled theory with practical performance in diffusion models.

[463] SurfDist: Interpretable Three-Dimensional Instance Segmentation Using Curved Surface Patches

Jackson Borchardt, Saul Kato

Main category: cs.CV

TL;DR: SurfDist is a 3D volumetric instance segmentation method using smooth parametric surface patches that outperforms StarDist-3D for blob-shaped biomedical instances with more compact parameterizations.

DetailsMotivation: To overcome StarDist-3D's limitation of coupling instance parameterization dimension with voxel resolution, and to enable high-resolution predictions without voxelization artifacts using smooth surface representations.

Method: Modified StarDist-3D architecture using convolutional neural networks to predict instances as closed surfaces composed of bicubic Bézier triangles, decoupling parameterization from resolution.

Result: SurfDist outperforms StarDist-3D on both synthetic and real-world datasets with blob-shaped instances, achieving better performance with more compact parameterizations.

Conclusion: Interpretable instance surface models can be effectively learned alongside instance membership, providing high-resolution segmentation without voxelization artifacts.

Abstract: We present SurfDist, a convolutional neural network architecture for three-dimensional volumetric instance segmentation. SurfDist enables prediction of instances represented as closed surfaces composed of smooth parametric surface patches, specifically bicubic B'ezier triangles. SurfDist is a modification of the popular model architecture StarDist-3D which breaks StarDist-3D’s coupling of instance parameterization dimension and instance voxel resolution, and it produces predictions which may be upsampled to arbitrarily high resolutions without introduction of voxelization artifacts. For datasets with blob-shaped instances, common in biomedical imaging, SurfDist can outperform StarDist-3D with more compact instance parameterizations. We detail SurfDist’s technical implementation and show one synthetic and one real-world dataset for which it outperforms StarDist-3D. These results demonstrate that interpretable instance surface models can be learned effectively alongside instance membership.

[464] UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks

Peiran Wu, Yunze Liu, Zhengdong Zhu, Enmin Zhou, Junxiao Shen

Main category: cs.CV

TL;DR: UGC-VideoCap introduces a new benchmark and 3B parameter model for omnimodal video captioning of user-generated content, addressing the audio-visual imbalance in existing approaches through balanced multimodal integration and efficient training.

DetailsMotivation: Existing video captioning benchmarks and models are predominantly visual-centric, overlooking audio's crucial role in conveying scene dynamics, speaker intent, and narrative context in real-world user-generated videos like TikTok content.

Method: Created UGC-VideoCap benchmark with 1000 TikTok videos annotated through three-stage human-in-the-loop pipeline (audio-only, visual-only, joint audio-visual). Proposed UGC-VideoCaptioner(3B) model distilled from Gemini 2.5 Flash using two-stage training: supervised fine-tuning followed by Group Relative Policy Optimization (GRPO).

Result: Developed a comprehensive benchmark with 4000 carefully crafted QA pairs probing unimodal and cross-modal understanding. The model enables efficient adaptation from limited data while maintaining competitive performance.

Conclusion: UGC-VideoCap provides a high-quality foundation and data-efficient solution for advancing omnimodal video captioning in unconstrained real-world user-generated content settings, addressing the audio-visual imbalance in current approaches.

Abstract: Real-world user-generated videos, especially on platforms like TikTok, often feature rich and intertwined audio visual content. However, existing video captioning benchmarks and models remain predominantly visual centric, overlooking the crucial role of audio in conveying scene dynamics, speaker intent, and narrative context. This lack of omni datasets and lightweight, capable models hampers progress in fine grained, multimodal video understanding. To address these challenges, we introduce UGC-VideoCap, a new benchmark and model framework specifically designed for detailed omnimodal captioning of short form user-generated videos. Unlike prior datasets, UGC-VideoCap emphasizes balanced integration of audio and visual modalities, featuring 1000 TikTok videos annotated through a structured three stage human-in-the-loop pipeline covering audio only, visual only, and joint audio visual semantics. The benchmark also includes 4000 carefully crafted QA pairs probing both unimodal and cross modal understanding. Alongside the dataset, we propose UGC-VideoCaptioner(3B), a 3B parameter captioning model distilled from Gemini 2.5 Flash. Using a novel two-stage training strategy supervised fine tuning followed by Group Relative Policy Optimization (GRPO), our approach enables efficient adaptation from limited data while maintaining competitive performance. Together, our benchmark and model offer a high-quality foundation and a data-efficient solution for advancing omnimodal video captioning in unconstrained real-world UGC settings.

[465] DiViD: Disentangled Video Diffusion for Static-Dynamic Factorization

Marzieh Gheisari, Auguste Genovesio

Main category: cs.CV

TL;DR: DiViD is the first end-to-end video diffusion framework that explicitly disentangles static appearance and dynamic motion in videos, outperforming existing VAE- and GAN-based approaches by preventing information leakage and improving reconstruction quality.

DetailsMotivation: Existing VAE- and GAN-based approaches for video disentanglement suffer from information leakage and blurry reconstructions, making unsupervised separation of static appearance and dynamic motion a fundamental challenge.

Method: DiViD uses a sequence encoder to extract global static tokens from the first frame and per-frame dynamic tokens, with a conditional DDPM decoder incorporating shared-noise schedules, time-varying KL-based bottlenecks, cross-attention mechanisms, and orthogonality regularization to prevent static-dynamic leakage.

Result: DiViD achieves the highest swap-based joint accuracy, preserves static fidelity while improving dynamic transfer, and reduces average cross-leakage compared to state-of-the-art sequential disentanglement methods on real-world benchmarks.

Conclusion: DiViD successfully demonstrates effective static-dynamic factorization in videos through its novel diffusion-based framework with explicit inductive biases and regularization techniques.

Abstract: Unsupervised disentanglement of static appearance and dynamic motion in video remains a fundamental challenge, often hindered by information leakage and blurry reconstructions in existing VAE- and GAN-based approaches. We introduce DiViD, the first end-to-end video diffusion framework for explicit static-dynamic factorization. DiViD’s sequence encoder extracts a global static token from the first frame and per-frame dynamic tokens, explicitly removing static content from the motion code. Its conditional DDPM decoder incorporates three key inductive biases: a shared-noise schedule for temporal consistency, a time-varying KL-based bottleneck that tightens at early timesteps (compressing static information) and relaxes later (enriching dynamics), and cross-attention that routes the global static token to all frames while keeping dynamic tokens frame-specific. An orthogonality regularizer further prevents residual static-dynamic leakage. We evaluate DiViD on real-world benchmarks using swap-based accuracy and cross-leakage metrics. DiViD outperforms state-of-the-art sequential disentanglement methods: it achieves the highest swap-based joint accuracy, preserves static fidelity while improving dynamic transfer, and reduces average cross-leakage.

[466] SIA: Enhancing Safety via Intent Awareness for Vision-Language Models

Youngjin Na, Sangheon Jeong, Youngwan Lee, Jian Lee, Dawoon Jeong, Youngman Kim

Main category: cs.CV

TL;DR: SIA is a training-free safety framework that proactively detects harmful intent in multimodal inputs and guides safe response generation through visual abstraction, intent inference, and intent-conditioned generation.

DetailsMotivation: Existing multimodal safety approaches often fail to address latent risks where harmfulness arises from interactions between modalities in vision-language models.

Method: Three-stage process: (1) visual abstraction via captioning, (2) intent inference through few-shot chain-of-thought prompting, (3) intent-conditioned response generation.

Result: Extensive experiments on safety benchmarks (SIUO, MM-SafetyBench, HoliSafe) show SIA consistently improves safety and outperforms prior training-free methods.

Conclusion: SIA effectively mitigates harmful outputs in vision-language models by dynamically adapting to implicit intent without requiring extensive retraining.

Abstract: With the growing deployment of Vision-Language Models (VLMs) in real-world applications, previously overlooked safety risks are becoming increasingly evident. In particular, seemingly innocuous multimodal inputs can combine to reveal harmful intent, leading to unsafe model outputs. While multimodal safety has received increasing attention, existing approaches often fail to address such latent risks, especially when harmfulness arises only from the interaction between modalities. We propose SIA (Safety via Intent Awareness), a training-free, intent-aware safety framework that proactively detects harmful intent in multimodal inputs and uses it to guide the generation of safe responses. SIA follows a three-stage process: (1) visual abstraction via captioning; (2) intent inference through few-shot chain-of-thought (CoT) prompting; and (3) intent-conditioned response generation. By dynamically adapting to the implicit intent inferred from an image-text pair, SIA mitigates harmful outputs without extensive retraining. Extensive experiments on safety benchmarks, including SIUO, MM-SafetyBench, and HoliSafe, show that SIA consistently improves safety and outperforms prior training-free methods.

[467] SAR-TEXT: A Large-Scale SAR Image-Text Dataset Built with SAR-Narrator and A Progressive Learning Strategy for Downstream Tasks

Yiguo He, Xinjun Cheng, Junjie Zhu, Chunping Qiu, Jun Wang, Xichuan Zhang, Qiangjuan Huang, Ke Yang

Main category: cs.CV

TL;DR: This paper introduces SAR-TEXT, a large-scale dataset of 130,000+ SAR image-text pairs, and demonstrates its effectiveness on three vision-language tasks through improved models that achieve significant performance gains.

DetailsMotivation: The lack of large-scale, high-quality SAR image-text datasets hinders semantic understanding of SAR imagery, despite its all-weather capability being essential in remote sensing.

Method: Constructed SAR-TEXT dataset using SAR-Narrator framework (multi-stage strategy for generating textual descriptions), then built three models: SAR-RS-CLIP for retrieval, SAR-RS-CoCa for captioning, and SAR-GPT for VQA.

Result: SAR-RS-CLIP improved average recall by 12.97% and 10.0% on test sets; SAR-RS-CoCa achieved significant improvements in captioning metrics; SAR-GPT outperformed baselines on VQA tasks with stronger semantic understanding.

Conclusion: SAR-TEXT dataset effectively advances SAR semantic understanding, and SAR-Narrator provides a flexible tool for the community to construct larger-scale datasets.

Abstract: Vision Language Models (VLMs) have achieved remarkable breakthroughs in the field of remote sensing in recent years. Synthetic Aperture Radar (SAR) imagery, with its all-weather capability, is essential in remote sensing, yet the lack of large-scale, high-quality SAR image-text datasets hinders its semantic understanding. In this paper, we construct SAR-TEXT, a large-scale and high-quality dataset consisting of over 130,000 SAR image-text pairs. To construct the SAR-TEXT dataset, we design the SAR-Narrator framework, which generates textual descriptions for SAR images through a multi-stage strategy. To verify the effectiveness of the SAR-TEXT dataset, we conduct experiments on three typical vision-language tasks: image-text retrieval, image captioning, and visual question answering (VQA). Specifically, we construct three representative models on SAR-TEXT: SAR-RS-CLIP, SAR-RS-CoCa, and SAR-GPT. SAR-RS-CLIP achieves notable improvements in retrieval performance, boosting average recall by 12.97% and 10.0% on the OSdataset_512 and HRSID test sets, respectively. In the captioning task, SAR-RS-CoCa achieves significant improvements over the original CoCa models in terms of BLEU-4, SPICE, and CIDEr scores. In the VQA task, SAR-GPT outperforms baseline and single-stage models on multiple SAR-VQA datasets, demonstrating stronger semantic understanding and reasoning ability, as further confirmed by qualitative results. It is worth noting that, as a flexible captioning tool, SAR-Narrator can be readily adopted by the community to construct larger-scale SAR image-text datasets. All code, pretrained models, and the SAR-Text dataset are publicly available at: https://github.com/YiguoHe/SAR-TEXT.

[468] RESCUE: Crowd Evacuation Simulation via Controlling SDM-United Characters

Xiaolin Liu, Tianyi Zhou, Hongbo Kang, Jian Ma, Ziwen Wang, Jing Huang, Wenguo Weng, Yu-Kun Lai, Kun Li

Main category: cs.CV

TL;DR: A real-time 3D crowd evacuation simulation framework that integrates sensory-decision-motor flow with 3D-adaptive Social Force Model and personalized gait control for more realistic evacuation behaviors.

DetailsMotivation: Current evacuation models overlook complex human behaviors like collisions, interactions, terrain influences, and body shape variations, failing to accurately simulate real-world escape scenarios.

Method: Proposed framework integrates 3D-adaptive SFM Decision Mechanism and Personalized Gait Control Motor, enabling parallel agent movement with dynamic crowd awareness and part-level force visualization.

Result: Framework supports dynamic trajectory planning and personalized behavior for each agent, compatible with uneven terrain, generating more realistic and plausible evacuation results.

Conclusion: The method provides enhanced insights for crowd simulation with more realistic evacuation behaviors across various scenarios, with code publicly available.

Abstract: Crowd evacuation simulation is critical for enhancing public safety, and demanded for realistic virtual environments. Current mainstream evacuation models overlook the complex human behaviors that occur during evacuation, such as pedestrian collisions, interpersonal interactions, and variations in behavior influenced by terrain types or individual body shapes. This results in the failure to accurately simulate the escape of people in the real world. In this paper, aligned with the sensory-decision-motor (SDM) flow of the human brain, we propose a real-time 3D crowd evacuation simulation framework that integrates a 3D-adaptive SFM (Social Force Model) Decision Mechanism and a Personalized Gait Control Motor. This framework allows multiple agents to move in parallel and is suitable for various scenarios, with dynamic crowd awareness. Additionally, we introduce Part-level Force Visualization to assist in evacuation analysis. Experimental results demonstrate that our framework supports dynamic trajectory planning and personalized behavior for each agent throughout the evacuation process, and is compatible with uneven terrain. Visually, our method generates evacuation results that are more realistic and plausible, providing enhanced insights for crowd simulation. The code is available at http://cic.tju.edu.cn/faculty/likun/projects/RESCUE.

[469] ReMoMask: Retrieval-Augmented Masked Motion Generation

Zhengdao Li, Siheng Wang, Zeyu Zhang, Hao Tang

Main category: cs.CV

TL;DR: ReMoMask is a unified framework for text-to-motion generation that integrates bidirectional momentum modeling, semantic spatio-temporal attention, and RAG-classifier-free guidance to overcome limitations in both generative and retrieval-augmented approaches.

DetailsMotivation: Current text-to-motion generation methods face dual challenges: generative models suffer from limited diversity and physical implausibility, while retrieval-augmented methods exhibit diffusion inertia and asynchronous artifacts.

Method: ReMoMask integrates three innovations: 1) Bidirectional Momentum Text-Motion Model with momentum queues, 2) Semantic Spatio-temporal Attention for biomechanical constraints, 3) RAG-Classier-Free Guidance with minor unconditional generation.

Result: ReMoMask achieves state-of-the-art performance with 3.88% and 10.97% improvement in FID scores on HumanML3D and KIT-ML benchmarks respectively compared to previous SOTA method RAG-T2M.

Conclusion: ReMoMask effectively addresses key limitations in text-to-motion generation by unifying retrieval and generative approaches, producing temporally coherent motions with improved physical plausibility and semantic alignment.

Abstract: Text-to-Motion (T2M) generation aims to synthesize realistic and semantically aligned human motion sequences from natural language descriptions. However, current approaches face dual challenges: Generative models (e.g., diffusion models) suffer from limited diversity, error accumulation, and physical implausibility, while Retrieval-Augmented Generation (RAG) methods exhibit diffusion inertia, partial-mode collapse, and asynchronous artifacts. To address these limitations, we propose ReMoMask, a unified framework integrating three key innovations: 1) A Bidirectional Momentum Text-Motion Model decouples negative sample scale from batch size via momentum queues, substantially improving cross-modal retrieval precision; 2) A Semantic Spatio-temporal Attention mechanism enforces biomechanical constraints during part-level fusion to eliminate asynchronous artifacts; 3) RAG-Classier-Free Guidance incorporates minor unconditional generation to enhance generalization. Built upon MoMask’s RVQ-VAE, ReMoMask efficiently generates temporally coherent motions in minimal steps. Extensive experiments on standard benchmarks demonstrate the state-of-the-art performance of ReMoMask, achieving a 3.88% and 10.97% improvement in FID scores on HumanML3D and KIT-ML, respectively, compared to the previous SOTA method RAG-T2M. Code: https://github.com/AIGeeksGroup/ReMoMask. Website: https://aigeeksgroup.github.io/ReMoMask.

[470] Follow-Your-Shape: Shape-Aware Image Editing via Trajectory-Guided Region Control

Zeqian Long, Mingzhe Zheng, Kunyu Feng, Xinhua Zhang, Hongyu Liu, Harry Yang, Linfeng Zhang, Qifeng Chen, Yue Ma

Main category: cs.CV

TL;DR: Follow-Your-Shape is a training-free framework for precise object shape editing that preserves background content using Trajectory Divergence Maps and Scheduled KV Injection.

DetailsMotivation: Existing flow-based image editing models struggle with large-scale shape transformations, often failing to achieve intended changes or degrading background quality.

Method: Uses Trajectory Divergence Maps to locate editable regions and Scheduled KV Injection for stable editing, without requiring training or masks.

Result: Achieves superior editability and visual fidelity in shape replacement tasks, outperforming existing methods.

Conclusion: The proposed framework enables precise shape editing while preserving non-target content, validated by the new ReShapeBench benchmark.

Abstract: While recent flow-based image editing models demonstrate general-purpose capabilities across diverse tasks, they often struggle to specialize in challenging scenarios – particularly those involving large-scale shape transformations. When performing such structural edits, these methods either fail to achieve the intended shape change or inadvertently alter non-target regions, resulting in degraded background quality. We propose Follow-Your-Shape, a training-free and mask-free framework that supports precise and controllable editing of object shapes while strictly preserving non-target content. Motivated by the divergence between inversion and editing trajectories, we compute a Trajectory Divergence Map (TDM) by comparing token-wise velocity differences between the inversion and denoising paths. The TDM enables precise localization of editable regions and guides a Scheduled KV Injection mechanism that ensures stable and faithful editing. To facilitate a rigorous evaluation, we introduce ReShapeBench, a new benchmark comprising 120 new images and enriched prompt pairs specifically curated for shape-aware editing. Experiments demonstrate that our method achieves superior editability and visual fidelity, particularly in tasks requiring large-scale shape replacement.

[471] Deep Spectral Epipolar Representations for Dense Light Field Reconstruction

Noor Islam S. Mohammad

Main category: cs.CV

TL;DR: A novel Deep Spectral Epipolar Representation (DSER) framework for dense light field depth reconstruction that combines spectral feature learning with epipolar-domain regularization to achieve high accuracy and efficiency.

DetailsMotivation: Existing deep convolutional approaches for dense depth reconstruction from light fields suffer from high computational overhead and sensitivity to noise and disparity inconsistencies in real-world scenarios.

Method: The DSER framework unifies deep spectral feature learning with epipolar-domain regularization, exploiting frequency-domain correlations across epipolar plane images to enforce global structural coherence and mitigate artifacts.

Result: Comprehensive experiments on the 4D Light Field Benchmark and real-world datasets show DSER achieves superior performance in precision, structural consistency, and computational efficiency compared to state-of-the-art methods.

Conclusion: DSER demonstrates the potential of integrating spectral priors with epipolar geometry for scalable and noise-resilient dense light field depth estimation, establishing it as a promising direction for next-generation high-dimensional vision systems.

Abstract: Accurate and efficient dense depth reconstruction from light field imagery remains a central challenge in computer vision, underpinning applications such as augmented reality, biomedical imaging, and 3D scene reconstruction. Existing deep convolutional approaches, while effective, often incur high computational overhead and are sensitive to noise and disparity inconsistencies in real-world scenarios. This paper introduces a novel Deep Spectral Epipolar Representation (DSER) framework for dense light field reconstruction, which unifies deep spectral feature learning with epipolar-domain regularization. The proposed approach exploits frequency-domain correlations across epipolar plane images to enforce global structural coherence, thereby mitigating artifacts and enhancing depth accuracy. Unlike conventional supervised models, DSER operates efficiently with limited training data while maintaining high reconstruction fidelity. Comprehensive experiments on the 4D Light Field Benchmark and a diverse set of real-world datasets demonstrate that DSER achieves superior performance in terms of precision, structural consistency, and computational efficiency compared to state-of-the-art methods. These results highlight the potential of integrating spectral priors with epipolar geometry for scalable and noise-resilient dense light field depth estimation, establishing DSER as a promising direction for next-generation high-dimensional vision systems.

[472] Event-driven Robust Fitting on Neuromorphic Hardware

Tam Ngoc-Bang Nguyen, Anh-Dzung Doan, Zhipeng Cai, Tat-Jun Chin

Main category: cs.CV

TL;DR: This paper presents a neuromorphic approach to robust geometric model fitting using spiking neural networks on Intel Loihi 2 hardware, achieving 85% energy reduction compared to standard CPU implementations.

DetailsMotivation: Energy efficiency has become critical for AI adoption, but robust fitting algorithms have received little attention in this aspect. The authors aim to address the growing concern of high energy consumption in computer vision pipelines.

Method: Developed a novel spiking neural network for robust fitting on Intel Loihi 2 neuromorphic hardware, with event-driven formulations of model estimation and algorithmic strategies to overcome hardware limitations in precision and instruction sets.

Result: The neuromorphic robust fitting implementation consumes only 15% of the energy required by established robust fitting algorithms on standard CPUs while achieving equivalent accuracy.

Conclusion: Neuromorphic computing offers a promising pathway for energy-efficient robust fitting in computer vision, demonstrating significant energy savings without compromising accuracy.

Abstract: Robust fitting of geometric models is a fundamental task in many computer vision pipelines. Numerous innovations have been produced on the topic, from improving the efficiency and accuracy of random sampling heuristics to generating novel theoretical insights that underpin new approaches with mathematical guarantees. However, one aspect of robust fitting that has received little attention is energy efficiency. This performance metric has become critical as high energy consumption is a growing concern for AI adoption. In this paper, we explore energy-efficient robust fitting via the neuromorphic computing paradigm. Specifically, we designed a novel spiking neural network for robust fitting on real neuromorphic hardware, the Intel Loihi 2. Enabling this are novel event-driven formulations of model estimation that allow robust fitting to be implemented in the unique architecture of Loihi 2, and algorithmic strategies to alleviate the current limited precision and instruction set of the hardware. Results show that our neuromorphic robust fitting consumes only a fraction (15%) of the energy required to run the established robust fitting algorithm on a standard CPU to equivalent accuracy.

[473] TSLA: A Task-Specific Learning Adaptation for Semantic Segmentation on Autonomous Vehicles Platform

Jun Liu, Zhenglun Kong, Pu Zhao, Weihao Zeng, Hao Tang, Xuan Shen, Changdi Yang, Wenbin Zhang, Geng Yuan, Wei Niu, Xue Lin, Yanzhi Wang

Main category: cs.CV

TL;DR: A dynamic adaptation framework for semantic segmentation networks in autonomous driving that uses three-tier control mechanisms and Bayesian Optimization to optimize model configurations for different hardware constraints and driving scenarios.

DetailsMotivation: Autonomous driving platforms face diverse scenarios with varying hardware resources and precision requirements, requiring cost-effective deployment on embedded devices like NVIDIA DRIVE PX 2.

Method: Three-tier control mechanism (width multiplier, classifier depth, classifier kernel) for fine-grained model adaptation, combined with Bayesian Optimization for efficient hyperparameter search under computational constraints.

Result: Enables broad model scaling, targeted layer refinement, and scenario-specific optimization, leading to improved resource allocation and performance through Task-Specific Learning Adaptation (TSLA).

Conclusion: The approach successfully customizes semantic segmentation networks for autonomous driving hardware, maximizing computational capacity and model accuracy while optimizing hardware utilization across diverse self-driving tasks.

Abstract: Autonomous driving platforms encounter diverse driving scenarios, each with varying hardware resources and precision requirements. Given the computational limitations of embedded devices, it is crucial to consider computing costs when deploying on target platforms like the NVIDIA\textsuperscript{\textregistered} DRIVE PX 2. Our objective is to customize the semantic segmentation network according to the computing power and specific scenarios of autonomous driving hardware. We implement dynamic adaptability through a three-tier control mechanism – width multiplier, classifier depth, and classifier kernel – allowing fine-grained control over model components based on hardware constraints and task requirements. This adaptability facilitates broad model scaling, targeted refinement of the final layers, and scenario-specific optimization of kernel sizes, leading to improved resource allocation and performance. Additionally, we leverage Bayesian Optimization with surrogate modeling to efficiently explore hyperparameter spaces under tight computational budgets. Our approach addresses scenario-specific and task-specific requirements through automatic parameter search, accommodating the unique computational complexity and accuracy needs of autonomous driving. It scales its Multiply-Accumulate Operations (MACs) for Task-Specific Learning Adaptation (TSLA), resulting in alternative configurations tailored to diverse self-driving tasks. These TSLA customizations maximize computational capacity and model accuracy, optimizing hardware utilization.

[474] The Telephone Game: Evaluating Semantic Drift in Unified Models

Sabbir Mollah, Rohit Gupta, Sirnam Swetha, Qingyang Liu, Ahnaf Munir, Mubarak Shah

Main category: cs.CV

TL;DR: The paper introduces the Semantic Drift Protocol (SDP) to evaluate cross-modal consistency in unified visual language models, measuring how well models maintain semantic meaning when cycling between image-to-text and text-to-image tasks.

DetailsMotivation: Existing evaluation benchmarks assess visual understanding (I2T) and visual generation (T2I) in isolation, failing to reveal whether models can consistently maintain semantic meaning when alternating between modalities.

Method: Proposed Semantic Drift Protocol (SDP) - a cyclic evaluation protocol that alternates I2T and T2I over multiple generations. Introduces two metrics: Mean Cumulative Drift (MCD) for embedding-based semantic drift measurement, and Multi-Generation GenEval (MGG) for object-level compliance.

Result: Evaluation on seven models using Nocaps+Docci400 benchmark revealed substantial variation in cross-modal stability. Some models like BAGEL maintained semantic meaning over many alternations, while others like VILA-U drifted quickly despite strong single-pass scores.

Conclusion: SDP serves as a necessary complement to standard I2T and T2I evaluations, revealing cross-modal consistency issues that isolated single-pass metrics cannot detect.

Abstract: Employing a single, unified model (UM) for both visual understanding (image-to-text: I2T) and visual generation (text-to-image: T2I) has opened a new direction in Visual Language Model (VLM) research. While UMs can also support broader unimodal tasks (e.g., text-to-text, image-to-image), we focus on the core cross-modal pair T2I and I2T. Existing evaluation benchmarks consider these capabilities in isolation: FID and GenEval for T2I, and benchmarks such as MME, MMBench for I2T. These isolated single-pass metrics do not reveal cross-consistency: whether a model that “understands” a concept can also “render” it, nor whether semantic meaning is preserved when cycling between image and text modalities. To address this, we introduce the Semantic Drift Protocol (SDP) for Unified Models, a cyclic evaluation protocol that alternates I2T and T2I over multiple generations to quantify semantic drift. We propose two metrics: (i) Mean Cumulative Drift (MCD), an embedding-based measure of overall semantic drift; and (ii) Multi-Generation GenEval (MGG), an object-level compliance score extending GenEval. To assess generalization beyond COCO dataset, which is widely used in training; we create a new benchmark Nocaps+Docci400, sampled from NoCaps and DOCCI and evaluated on seven recent models. SDP reveals substantial variation in cross-modal stability: some models like BAGEL maintain semantic meaning over many alternations, whereas others like VILA-U drift quickly despite strong single-pass scores. Our results highlight SDP as a necessary complement to standard I2T and T2I evaluations. Code is available at https://github.com/mollahsabbir/Semantic-Drift-in-Unified-Models

[475] Bias in Gender Bias Benchmarks: How Spurious Features Distort Evaluation

Yusuke Hirota, Ryo Hachiuma, Boyi Li, Ximing Lu, Michael Ross Boone, Boris Ivanovic, Yejin Choi, Marco Pavone, Yu-Chiang Frank Wang, Noa Garcia, Yuta Nakashima, Chao-Han Huck Yang

Main category: cs.CV

TL;DR: Current gender bias evaluations in VLMs are unreliable due to spurious correlations between gender and non-gender features like objects and backgrounds, which can dramatically alter bias scores with minimal perturbations.

DetailsMotivation: Gender bias in vision-language models raises safety concerns, but existing benchmarks contain spurious correlations that may distort bias evaluation results.

Method: Systematically perturb non-gender features across four benchmarks (COCO-gender, FACET, MIAP, PHASE) and various VLMs to quantify impact on bias evaluation.

Result: Minimal perturbations (10% object masking, weak background blurring) can dramatically alter bias scores by up to 175% in generative VLMs and 43% in CLIP variants.

Conclusion: Current bias evaluations reflect model responses to spurious features rather than true gender bias. Recommendations include reporting bias metrics alongside feature-sensitivity measurements for more reliable assessment.

Abstract: Gender bias in vision-language foundation models (VLMs) raises concerns about their safe deployment and is typically evaluated using benchmarks with gender annotations on real-world images. However, as these benchmarks often contain spurious correlations between gender and non-gender features, such as objects and backgrounds, we identify a critical oversight in gender bias evaluation: Do spurious features distort gender bias evaluation? To address this question, we systematically perturb non-gender features across four widely used benchmarks (COCO-gender, FACET, MIAP, and PHASE) and various VLMs to quantify their impact on bias evaluation. Our findings reveal that even minimal perturbations, such as masking just 10% of objects or weakly blurring backgrounds, can dramatically alter bias scores, shifting metrics by up to 175% in generative VLMs and 43% in CLIP variants. This suggests that current bias evaluations often reflect model responses to spurious features rather than gender bias, undermining their reliability. Since creating spurious feature-free benchmarks is fundamentally challenging, we recommend reporting bias metrics alongside feature-sensitivity measurements to enable a more reliable bias assessment.

[476] Graph Algorithm Unrolling with Douglas-Rachford Iterations for Image Interpolation with Guaranteed Initialization

Xue Zhang, Bingshuo Hu, Gene Cheung

Main category: cs.CV

TL;DR: The paper proposes a novel approach to image interpolation by initializing a directed graph adjacency matrix based on a known interpolator, then learning perturbation matrices from data to enhance performance through unrolled Douglas-Rachford iterations, achieving state-of-the-art results with significantly fewer parameters.

DetailsMotivation: Conventional DNNs initialize parameters randomly and optimize via SGD, which risks poor local minima. The authors aim to develop a more reliable initialization and optimization strategy for image interpolation.

Method: Initialize directed graph adjacency matrix A using a known interpolator Θ, then learn perturbation matrices P and P(2) from data. Implement restoration effects via Douglas-Rachford iterations, which are unrolled into an interpretable neural network.

Result: Experimental results show state-of-the-art image interpolation performance while drastically reducing the number of network parameters compared to conventional approaches.

Conclusion: The proposed method successfully addresses the limitations of random initialization in DNNs by leveraging graph-based initialization and data-driven perturbations, achieving superior interpolation results with improved parameter efficiency.

Abstract: Conventional deep neural nets (DNNs) initialize network parameters at random and then optimize each one via stochastic gradient descent (SGD), resulting in substantial risk of poor-performing local minima.Focusing on the image interpolation problem and leveraging a recent theorem that maps a (pseudo-)linear interpolator {\Theta} to a directed graph filter that is a solution to a MAP problem regularized with a graph shift variation (GSV) prior, we first initialize a directed graph adjacency matrix A based on a known interpolator {\Theta}, establishing a baseline performance.Then, towards further gain, we learn perturbation matrices P and P(2) from data to augment A, whose restoration effects are implemented via Douglas-Rachford (DR) iterations, which we unroll into a lightweight interpretable neural net.Experimental results demonstrate state-of-the-art image interpolation results, while drastically reducing network parameters.

[477] Do Vision-Language Models See Urban Scenes as People Do? An Urban Perception Benchmark

Rashid Mushkani

Main category: cs.CV

TL;DR: A benchmark for testing vision-language models on urban perception using 100 Montreal street images (50 real photos, 50 synthetic), with human annotations across 30 dimensions. Evaluated 7 VLMs in zero-shot setup, showing better performance on objective properties than subjective appraisals.

DetailsMotivation: To understand how people read city scenes and inform urban design/planning by creating a benchmark for evaluating vision-language models on urban perception tasks.

Method: Used 100 Montreal street images (50 real, 50 synthetic) with 230 human annotations across 30 dimensions. Evaluated 7 VLMs in zero-shot setup with structured prompts and deterministic parser. Measured accuracy for single-choice items and Jaccard overlap for multi-label items.

Result: Models performed better on visible, objective properties than subjective appraisals. Claude-sonnet achieved macro 0.31 and mean Jaccard 0.48 on multi-label items. Higher human agreement correlated with better model scores. Synthetic images slightly lowered performance.

Conclusion: The benchmark enables reproducible, uncertainty-aware evaluation of VLMs for participatory urban analysis, showing current limitations in capturing subjective urban perceptions compared to objective properties.

Abstract: Understanding how people read city scenes can inform design and planning. We introduce a small benchmark for testing vision-language models (VLMs) on urban perception using 100 Montreal street images, evenly split between photographs and photorealistic synthetic scenes. Twelve participants from seven community groups supplied 230 annotation forms across 30 dimensions mixing physical attributes and subjective impressions. French responses were normalized to English. We evaluated seven VLMs in a zero-shot setup with a structured prompt and deterministic parser. We use accuracy for single-choice items and Jaccard overlap for multi-label items; human agreement uses Krippendorff’s alpha and pairwise Jaccard. Results suggest stronger model alignment on visible, objective properties than subjective appraisals. The top system (claude-sonnet) reaches macro 0.31 and mean Jaccard 0.48 on multi-label items. Higher human agreement coincides with better model scores. Synthetic images slightly lower scores. We release the benchmark, prompts, and harness for reproducible, uncertainty-aware evaluation in participatory urban analysis.

[478] Latent Visual Reasoning

Bangzheng Li, Ximeng Sun, Jiang Liu, Ze Wang, Jialian Wu, Xiaodong Yu, Hao Chen, Emad Barsoum, Muhao Chen, Zicheng Liu

Main category: cs.CV

TL;DR: LVR enables autoregressive reasoning directly in visual embedding space, improving perception-intensive VQA tasks by interleaving latent visual reasoning with text generation.

DetailsMotivation: Current MLLMs with CoT reasoning are constrained by treating visual information as static and confining reasoning to language space only.

Method: Projects images into visual tokens in joint semantic space, trains language model to generate latent states that reconstruct key visual tokens, and adapts GRPO algorithm for reinforcement learning on latent reasoning.

Result: Achieves 71.67% on MMVP benchmark compared to 66.67% with Qwen2.5-VL, showing substantial gains in fine-grained visual understanding.

Conclusion: LVR paradigm enables direct visual reasoning in embedding space, significantly enhancing perception capabilities of multimodal models.

Abstract: Multimodal Large Language Models (MLLMs) have achieved notable gains in various tasks by incorporating Chain-of-Thought (CoT) reasoning in language spaces. Recent work extends this direction by leveraging external tools for visual editing, thereby enhancing the visual signal along the reasoning trajectories. Nevertheless, these approaches remain fundamentally constrained: reasoning is still confined to the language space, with visual information treated as static preconditions. We introduce Latent Visual Reasoning (LVR), a new paradigm that enables autoregressive reasoning directly in the visual embedding space. A visual encoder first projects images into visual tokens within a joint semantic space shared with the language model. The language model is then trained to generate latent states that reconstruct key visual tokens critical for answering the query, constituting the process of latent visual reasoning. By interleaving LVR with standard text generation, our model achieves substantial gains on perception-intensive visual question answering tasks. In addition, we adapt the GRPO algorithm to conduct reinforcement learning on latent reasoning, further balancing LVR and textual generation. We show that LVR substantially improves fine-grained visual understanding and perception, achieving 71.67% on MMVP compared to 66.67% with Qwen2.5-VL. Code base and model weights will be released later.

[479] Smaller is Better: Enhancing Transparency in Vehicle AI Systems via Pruning

Sanish Suwal, Shaurya Garg, Dipkamal Bhusal, Michael Clifford, Nidhi Rastogi

Main category: cs.CV

TL;DR: Pruning significantly improves the quality and reliability of post-hoc explanations for traffic sign classifiers, making them more comprehensible and faithful compared to natural and adversarial training approaches.

DetailsMotivation: Connected and autonomous vehicles rely on AI systems where transparency and security are critical. Post-hoc explanations for black-box AI models often suffer from inconsistencies and lack of faithfulness, questioning their reliability.

Method: Systematically examined three training approaches (natural training, adversarial training, and pruning) on traffic sign classifiers, evaluating their impact on post-hoc explanation quality using saliency maps through extensive empirical evaluation.

Result: Pruning significantly enhances comprehensibility and faithfulness of explanations. It enforces sparsity in learned representation, leading to more interpretable and reliable decisions while improving model efficiency.

Conclusion: Pruning is a promising strategy for developing transparent deep learning models, especially beneficial for resource-constrained vehicular AI systems where both efficiency and interpretability are crucial.

Abstract: Connected and autonomous vehicles continue to heavily rely on AI systems, where transparency and security are critical for trust and operational safety. Post-hoc explanations provide transparency to these black-box like AI models but the quality and reliability of these explanations is often questioned due to inconsistencies and lack of faithfulness in representing model decisions. This paper systematically examines the impact of three widely used training approaches, namely natural training, adversarial training, and pruning, affect the quality of post-hoc explanations for traffic sign classifiers. Through extensive empirical evaluation, we demonstrate that pruning significantly enhances the comprehensibility and faithfulness of explanations (using saliency maps). Our findings reveal that pruning not only improves model efficiency but also enforces sparsity in learned representation, leading to more interpretable and reliable decisions. Additionally, these insights suggest that pruning is a promising strategy for developing transparent deep learning models, especially in resource-constrained vehicular AI systems.

[480] A Comparative Benchmark of Real-time Detectors for Blueberry Detection towards Precision Orchard Management

Xinyang Mu, Yuzhen Lu, Boyang Deng

Main category: cs.CV

TL;DR: Comparative benchmark analysis of 36 real-time object detectors (YOLO v8-v12 and RT-DETR v1-v2) for blueberry detection, evaluated on a new dataset of 85,879 labeled instances and enhanced with semi-supervised learning.

DetailsMotivation: Blueberry detection in natural environments is challenging due to variable lighting, occlusions, and motion blur. Deep learning detectors need large, diverse datasets and proper accuracy/speed/memory trade-offs for practical deployment.

Method: Evaluated 36 model variants on a curated dataset of 661 canopy images with 85,879 labeled blueberry instances. Fine-tuned models using Unbiased Mean Teacher-based semi-supervised learning on 1,035 unlabeled images.

Result: YOLOv12m achieved 93.3% mAP@50, RT-DETRv2-X achieved 93.6% mAP@50. SSL fine-tuning improved accuracy by up to 2.9%, with RT-DETR-v2-X reaching 94.8% mAP@50. Mid-sized models offered best accuracy-speed balance.

Conclusion: RT-DETR-v2-X achieved the highest accuracy after SSL fine-tuning. Semi-supervised learning shows promise but needs more research for cross-domain data. Dataset and software are publicly available for further research.

Abstract: Blueberry detection in natural environments remains challenging due to variable lighting, occlusions, and motion blur due to environmental factors and imaging devices. Deep learning-based object detectors promise to address these challenges, but they demand a large-scale, diverse dataset that captures the real-world complexities. Moreover, deploying these models in practical scenarios often requires the right accuracy/speed/memory trade-off in model selection. This study presents a novel comparative benchmark analysis of advanced real-time object detectors, including YOLO (You Only Look Once) (v8-v12) and RT-DETR (Real-Time Detection Transformers) (v1-v2) families, consisting of 36 model variants, evaluated on a newly curated dataset for blueberry detection. This dataset comprises 661 canopy images collected with smartphones during the 2022-2023 seasons, consisting of 85,879 labelled instances (including 36,256 ripe and 49,623 unripe blueberries) across a wide range of lighting conditions, occlusions, and fruit maturity stages. Among the YOLO models, YOLOv12m achieved the best accuracy with a mAP@50 of 93.3%, while RT-DETRv2-X obtained a mAP@50 of 93.6%, the highest among all the RT-DETR variants. The inference time varied with the model scale and complexity, and the mid-sized models appeared to offer a good accuracy-speed balance. To further enhance detection performance, all the models were fine-tuned using Unbiased Mean Teacher-based semi-supervised learning (SSL) on a separate set of 1,035 unlabeled images acquired by a ground-based machine vision platform in 2024. This resulted in accuracy gains ranging from -1.4% to 2.9%, with RT-DETR-v2-X achieving the best mAP@50 of 94.8%. More in-depth research into SSL is needed to better leverage cross-domain unlabeled data. Both the dataset and software programs of this study are made publicly available to support further research.

[481] Region-of-Interest Augmentation for Mammography Classification under Patient-Level Cross-Validation

Farbod Bigdeli, Mohsen Mohammadagha, Ali Bigdeli

Main category: cs.CV

TL;DR: ROI augmentation improves mammography classification in constrained datasets without extra labels or model changes.

DetailsMotivation: Deep learning for mammogram interpretation is limited by small datasets and low resolution. ROI augmentation addresses these constraints.

Method: Lightweight ROI augmentation: probabilistically replace full images with random ROI crops from precomputed bounding-box bank during training.

Result: Modest ROC-AUC gains with best parameters (p_roi=0.10, alpha=0.10), variable performance across folds, flat/slightly lower PR-AUC.

Conclusion: Simple ROI strategies can enhance mammography classification in constrained settings without requiring additional labels or architectural changes.

Abstract: Breast cancer screening with mammography remains central to early detection and mortality reduction. Deep learning has shown strong potential for automating mammogram interpretation, yet limited-resolution datasets and small sample sizes continue to restrict performance. We revisit the Mini-DDSM dataset (9,684 images; 2,414 patients) and introduce a lightweight region-of-interest (ROI) augmentation strategy. During training, full images are probabilistically replaced with random ROI crops sampled from a precomputed, label-free bounding-box bank, with optional jitter to increase variability. We evaluate under strict patient-level cross-validation and report ROC-AUC, PR-AUC, and training-time efficiency metrics (throughput and GPU memory). Because ROI augmentation is training-only, inference-time cost remains unchanged. On Mini-DDSM, ROI augmentation (best: p_roi = 0.10, alpha = 0.10) yields modest average ROC-AUC gains, with performance varying across folds; PR-AUC is flat to slightly lower. These results demonstrate that simple, data-centric ROI strategies can enhance mammography classification in constrained settings without requiring additional labels or architectural modifications.

[482] Do Sparse Subnetworks Exhibit Cognitively Aligned Attention? Effects of Pruning on Saliency Map Fidelity, Sparsity, and Concept Coherence

Sanish Suwal, Dipkamal Bhusal, Michael Clifford, Nidhi Rastogi

Main category: cs.CV

TL;DR: This paper investigates how magnitude-based pruning affects neural network interpretability, showing that light-to-moderate pruning improves saliency map quality and concept coherence, while aggressive pruning reduces interpretability despite maintaining accuracy.

DetailsMotivation: To understand how pruning impacts model interpretability, as prior work focused on performance preservation but neglected interpretability effects.

Method: Used ResNet-18 on ImageNette, applied magnitude-based pruning followed by fine-tuning, evaluated with Vanilla Gradients and Integrated Gradients for saliency maps, and CRAFT-based concept extraction for semantic coherence analysis.

Result: Light-to-moderate pruning improved saliency-map focus and faithfulness while maintaining distinct concepts. Aggressive pruning merged features, reducing sparsity and concept coherence despite preserved accuracy.

Conclusion: Pruning can shape representations toward human-aligned attention patterns, but excessive pruning undermines interpretability, suggesting a trade-off between model compression and interpretability.

Abstract: Prior works have shown that neural networks can be heavily pruned while preserving performance, but the impact of pruning on model interpretability remains unclear. In this work, we investigate how magnitude-based pruning followed by fine-tuning affects both low-level saliency maps and high-level concept representations. Using a ResNet-18 trained on ImageNette, we compare post-hoc explanations from Vanilla Gradients (VG) and Integrated Gradients (IG) across pruning levels, evaluating sparsity and faithfulness. We further apply CRAFT-based concept extraction to track changes in semantic coherence of learned concepts. Our results show that light-to-moderate pruning improves saliency-map focus and faithfulness while retaining distinct, semantically meaningful concepts. In contrast, aggressive pruning merges heterogeneous features, reducing saliency map sparsity and concept coherence despite maintaining accuracy. These findings suggest that while pruning can shape internal representations toward more human-aligned attention patterns, excessive pruning undermines interpretability.

[483] Reasoning-Enhanced Domain-Adaptive Pretraining of Multimodal Large Language Models for Short Video Content Governance

Zixuan Wang, Yu Sun, Hongwei Wang, Baoyu Jing, Xiang Shen, Xin Dong, Zhuolin Hao, Hongyu Xiong, Yang Song

Main category: cs.CV

TL;DR: A reasoning-enhanced MLLM pretraining paradigm for unified inappropriate content detection in short videos, using three targeted tasks to bridge distribution gaps and improve generalization.

DetailsMotivation: Existing approaches require separate models for each content issue type, needing extensive human-labeled data and lacking cross-issue generalization capabilities.

Method: Three pretraining tasks: Caption (enhance video detail perception), VQA (deepen understanding of issue definitions), and Chain-of-Thought (improve reasoning capability).

Result: Significantly improves MLLM performance in both zero-shot and supervised fine-tuning settings, with strong generalization to emergent, unseen issues.

Conclusion: The proposed pretraining paradigm effectively addresses distribution gaps and complex issue definitions, enabling unified inappropriate content detection with improved generalization.

Abstract: Short video platforms are evolving rapidly, making the identification of inappropriate content increasingly critical. Existing approaches typically train separate and small classification models for each type of issue, which requires extensive human-labeled data and lacks cross-issue generalization. We propose a reasoning-enhanced multimodal large language model (MLLM) pretraining paradigm for unified inappropriate content detection. To address the distribution gap between short video content and the original pretraining data of MLLMs, as well as the complex issue definitions, we introduce three targeted pretraining tasks: (1) \textit{Caption}, to enhance the MLLM’s perception of video details; (2) \textit{Visual Question Answering (VQA)}, to deepen the MLLM’s understanding of issue definitions and annotation guidelines; (3) \textit{Chain-of-Thought (CoT)}, to enhance the MLLM’s reasoning capability. Experimental results show that our pretraining approach significantly improves the MLLM’s performance in both zero-shot and supervised fine-tuning (SFT) settings. In addition, our pretrained model demonstrates strong generalization capabilities to emergent, previously unseen issues.

[484] A Tale of Two Experts: Cooperative Learning for Source-Free Unsupervised Domain Adaptation

Jiaping Yu, Muli Yang, Jiapeng Ji, Jiexi Yan, Cheng Deng

Main category: cs.CV

TL;DR: EXCL proposes a source-free unsupervised domain adaptation method using dual experts (source model + vision-language model) with retrieval-augmented interaction to adapt models without source data access.

DetailsMotivation: Address privacy and cost concerns in domain adaptation by eliminating need for source data access, while overcoming limitations of existing methods that neglect complementary insights and target data structure.

Method: Dual Experts framework with frozen source model (Conv-Adapter) and pretrained vision-language model (trainable text prompt), plus three-stage RAIN pipeline: collaborative retrieval, separate fine-tuning, and learning object consistency enforcement.

Result: Extensive experiments on four benchmark datasets demonstrate state-of-the-art matching performance.

Conclusion: EXCL effectively adapts models to target domains without source data by leveraging dual experts and retrieval-augmented interaction, achieving competitive performance while addressing privacy and cost constraints.

Abstract: Source-Free Unsupervised Domain Adaptation (SFUDA) addresses the realistic challenge of adapting a source-trained model to a target domain without access to the source data, driven by concerns over privacy and cost. Existing SFUDA methods either exploit only the source model’s predictions or fine-tune large multimodal models, yet both neglect complementary insights and the latent structure of target data. In this paper, we propose the Experts Cooperative Learning (EXCL). EXCL contains the Dual Experts framework and Retrieval-Augmentation-Interaction optimization pipeline. The Dual Experts framework places a frozen source-domain model (augmented with Conv-Adapter) and a pretrained vision-language model (with a trainable text prompt) on equal footing to mine consensus knowledge from unlabeled target samples. To effectively train these plug-in modules under purely unsupervised conditions, we introduce Retrieval-Augmented-Interaction(RAIN), a three-stage pipeline that (1) collaboratively retrieves pseudo-source and complex target samples, (2) separately fine-tunes each expert on its respective sample set, and (3) enforces learning object consistency via a shared learning result. Extensive experiments on four benchmark datasets demonstrate that our approach matches state-of-the-art performance.

Xiafeng Man, Zhipeng Wei, Jingjing Chen

Main category: cs.CV

TL;DR: D-Plus-Minus (DPM) is a novel framework that detects copyright infringement in diffusion models by measuring output deviations when including/excluding training data, using differential privacy principles and fine-tuning simulations.

DetailsMotivation: Address legal and ethical concerns about large vision models memorizing copyrighted content, and overcome limitations of existing detection methods that lack robustness and theoretical foundations.

Method: Formalize copyright infringement from Differential Privacy perspective, introduce conditional sensitivity metric, simulate inclusion/exclusion via fine-tuning (learning/unlearning), compute confidence scores over orthogonal prompt distributions using statistical metrics.

Result: DPM reliably detects infringement content without requiring access to original training data or text prompts, demonstrating effectiveness across diverse categories in the Copyright Infringement Detection Dataset (CIDD).

Conclusion: DPM offers an interpretable and practical solution for safeguarding intellectual property in generative AI, providing robust copyright infringement detection with theoretical underpinnings.

Abstract: The widespread deployment of large vision models such as Stable Diffusion raises significant legal and ethical concerns, as these models can memorize and reproduce copyrighted content without authorization. Existing detection approaches often lack robustness and fail to provide rigorous theoretical underpinnings. To address these gaps, we formalize the concept of copyright infringement and its detection from the perspective of Differential Privacy (DP), and introduce the conditional sensitivity metric, a concept analogous to sensitivity in DP, that quantifies the deviation in a diffusion model’s output caused by the inclusion or exclusion of a specific training data point. To operationalize this metric, we propose D-Plus-Minus (DPM), a novel post-hoc detection framework that identifies copyright infringement in text-to-image diffusion models. Specifically, DPM simulates inclusion and exclusion processes by fine-tuning models in two opposing directions: learning or unlearning. Besides, to disentangle concept-specific influence from the global parameter shifts induced by fine-tuning, DPM computes confidence scores over orthogonal prompt distributions using statistical metrics. Moreover, to facilitate standardized benchmarking, we also construct the Copyright Infringement Detection Dataset (CIDD), a comprehensive resource for evaluating detection across diverse categories. Our results demonstrate that DPM reliably detects infringement content without requiring access to the original training dataset or text prompts, offering an interpretable and practical solution for safeguarding intellectual property in the era of generative AI.

[486] Mask What Matters: Controllable Text-Guided Masking for Self-Supervised Medical Image Analysis

Ruilang Wang, Shuotong Xu, Bowen Liu, Runlin Huang, Donglong Chen, Weifeng Su

Main category: cs.CV

TL;DR: Mask What Matters is a text-guided masking framework for self-supervised medical image analysis that uses vision-language models to selectively mask diagnostically relevant regions, achieving better performance with lower masking ratios than existing methods.

DetailsMotivation: Address the challenges of data scarcity in medical imaging and improve upon existing self-supervised masked image modeling approaches that rely on random high-ratio masking, which leads to inefficiency and poor semantic alignment.

Method: Leverages vision-language models for prompt-based region localization to apply differentiated masking that emphasizes diagnostically relevant regions while reducing redundancy in background areas, enabling controllable text-guided masking.

Result: Outperforms existing MIM methods across multiple medical imaging modalities (brain MRI, chest CT, lung X-ray) with gains of up to +3.1 percentage points in classification accuracy, +1.3 in BoxAP, and +1.1 in MaskAP, while using substantially lower masking ratios (40% vs 70%).

Conclusion: Controllable, text-driven masking enables semantically aligned self-supervised learning and advances robust vision models for medical image analysis.

Abstract: The scarcity of annotated data in specialized domains such as medical imaging presents significant challenges to training robust vision models. While self-supervised masked image modeling (MIM) offers a promising solution, existing approaches largely rely on random high-ratio masking, leading to inefficiency and poor semantic alignment. Moreover, region-aware variants typically depend on reconstruction heuristics or supervised signals, limiting their adaptability across tasks and modalities. We propose Mask What Matters, a controllable text-guided masking framework for self-supervised medical image analysis. By leveraging vision-language models for prompt-based region localization, our method flexibly applies differentiated masking to emphasize diagnostically relevant regions while reducing redundancy in background areas. This controllable design enables better semantic alignment, improved representation learning, and stronger cross-task generalizability. Comprehensive evaluation across multiple medical imaging modalities, including brain MRI, chest CT, and lung X-ray, shows that Mask What Matters consistently outperforms existing MIM methods (e.g., SparK), achieving gains of up to +3.1 percentage points in classification accuracy, +1.3 in box average precision (BoxAP), and +1.1 in mask average precision (MaskAP) for detection. Notably, it achieves these improvements with substantially lower overall masking ratios (e.g., 40% vs. 70%). This work demonstrates that controllable, text-driven masking can enable semantically aligned self-supervised learning, advancing the development of robust vision models for medical image analysis.

[487] OracleGS: Grounding Generative Priors for Sparse-View Gaussian Splatting

Atakan Topaloglu, Kunyi Li, Michael Niemeyer, Nassir Navab, A. Murat Tekalp, Federico Tombari

Main category: cs.CV

TL;DR: OracleGS is a framework that combines generative completeness with regressive fidelity for sparse-view novel view synthesis by using a propose-and-validate approach with 3D-aware diffusion models and multi-view stereo validation.

DetailsMotivation: Sparse-view novel view synthesis suffers from geometric ambiguity, creating a trade-off between geometrically faithful but incomplete regressive models and complete but structurally inconsistent generative models.

Method: Uses a propose-and-validate framework: first synthesizes novel views with a 3D-aware diffusion model to propose complete scenes, then validates 3D uncertainties using MVS model attention maps to guide Gaussian Splatting optimization with uncertainty-weighted loss.

Result: Outperforms state-of-the-art methods on Mip-NeRF 360 and NeRF Synthetic datasets by filtering hallucinatory artifacts while preserving plausible completions in under-constrained regions.

Conclusion: OracleGS successfully reconciles generative completeness with regressive fidelity by conditioning generative priors on multi-view geometric evidence, achieving superior sparse-view novel view synthesis.

Abstract: Sparse-view novel view synthesis is fundamentally ill-posed due to severe geometric ambiguity. Current methods are caught in a trade-off: regressive models are geometrically faithful but incomplete, whereas generative models can complete scenes but often introduce structural inconsistencies. We propose OracleGS, a novel framework that reconciles generative completeness with regressive fidelity for sparse view Gaussian Splatting. Instead of using generative models to patch incomplete reconstructions, our “propose-and-validate” framework first leverages a pre-trained 3D-aware diffusion model to synthesize novel views to propose a complete scene. We then repurpose a multi-view stereo (MVS) model as a 3D-aware oracle to validate the 3D uncertainties of generated views, using its attention maps to reveal regions where the generated views are well-supported by multi-view evidence versus where they fall into regions of high uncertainty due to occlusion, lack of texture, or direct inconsistency. This uncertainty signal directly guides the optimization of a 3D Gaussian Splatting model via an uncertainty-weighted loss. Our approach conditions the powerful generative prior on multi-view geometric evidence, filtering hallucinatory artifacts while preserving plausible completions in under-constrained regions, outperforming state-of-the-art methods on datasets including Mip-NeRF 360 and NeRF Synthetic.

[488] EfficientMIL: Efficient Linear-Complexity MIL Method for WSI Classification

Chengying She, Chengwei Chen, Dongjie Fan, Lizhuang Liu, Chengwei Shao, Yun Bian, Ben Wang, Xinran Zhang

Main category: cs.CV

TL;DR: EfficientMIL is a linear-complexity multiple instance learning approach for whole slide image classification that replaces quadratic self-attention with efficient sequence models (GRU, LSTM, Mamba) and introduces Adaptive Patch Selector for patches selection, achieving superior performance with significant computational efficiency gains.

DetailsMotivation: Current state-of-the-art MIL methods for whole slide image classification rely on attention mechanisms with quadratic complexity, requiring substantial computational resources when processing hundreds of thousands of patches, creating a computational bottleneck.

Method: Proposed EfficientMIL with Adaptive Patch Selector (APS) module that replaces quadratic-complexity self-attention mechanisms in Transformer-based MIL methods with efficient linear-complexity sequence models including RNN-based GRU, LSTM, and State Space Model Mamba.

Result: Achieved AUC of 0.976 and accuracy of 0.933 on TCGA-Lung dataset with EfficientMIL-Mamba, and AUC of 0.990 and accuracy of 0.975 on CAMELYON16 dataset with EfficientMIL-GRU, surpassing previous state-of-the-art methods. APS proved more effective for patches selection than conventional strategies.

Conclusion: EfficientMIL provides a computationally efficient solution for whole slide image classification that outperforms existing methods while significantly reducing computational complexity from quadratic to linear, making it suitable for large-scale pathology applications.

Abstract: Whole slide images (WSIs) classification represents a fundamental challenge in computational pathology, where multiple instance learning (MIL) has emerged as the dominant paradigm. Current state-of-the-art (SOTA) MIL methods rely on attention mechanisms, achieving good performance but requiring substantial computational resources due to quadratic complexity when processing hundreds of thousands of patches. To address this computational bottleneck, we introduce EfficientMIL, a novel linear-complexity MIL approach for WSIs classification with the patches selection module Adaptive Patch Selector (APS) that we designed, replacing the quadratic-complexity self-attention mechanisms in Transformer-based MIL methods with efficient sequence models including RNN-based GRU, LSTM, and State Space Model (SSM) Mamba. EfficientMIL achieves significant computational efficiency improvements while outperforming other MIL methods across multiple histopathology datasets. On TCGA-Lung dataset, EfficientMIL-Mamba achieved AUC of 0.976 and accuracy of 0.933, while on CAMELYON16 dataset, EfficientMIL-GRU achieved AUC of 0.990 and accuracy of 0.975, surpassing previous state-of-the-art methods. Extensive experiments demonstrate that APS is also more effective for patches selection than conventional selection strategies.

[489] AutoPrune: Each Complexity Deserves a Pruning Policy

Hanshi Wang, Yuhao Xu, Zekun Xu, Jin Gao, Yufan Liu, Weiming Hu, Ke Wang, Zhipeng Zhang

Main category: cs.CV

TL;DR: AutoPrune is a training-free framework that adaptively prunes visual tokens in vision-language models based on sample and task complexity, achieving 89% token reduction and 76.8% FLOPs reduction while maintaining 96.7% accuracy.

DetailsMotivation: Existing pruning methods use fixed schedules that don't align with the model's reasoning trajectory, failing to accommodate diverse input complexities. Human visual processing starts broad then narrows focus, suggesting adaptive pruning is needed.

Method: AutoPrune quantifies mutual information between visual and textual tokens, then projects this signal to budget-constrained logistic retention curves that adapt to different task complexities while maintaining computational constraints.

Result: Applied to LLaVA-1.5-7B, AutoPrune prunes 89% of visual tokens, reduces inference FLOPs by 76.8%, and retains 96.7% of original accuracy across tasks - a 9.1% improvement over PDrop.

Conclusion: Complexity-adaptive pruning effectively reduces computational demands while maintaining performance, demonstrating that adaptive strategies outperform fixed pruning schedules for vision-language models.

Abstract: The established redundancy in visual tokens within large vision-language models allows pruning to effectively reduce their substantial computational demands. Previous methods typically employ heuristic layer-specific pruning strategies where, although the number of tokens removed may differ across decoder layers, the overall pruning schedule is fixed and applied uniformly to all input samples and tasks, failing to align token elimination with the model’s holistic reasoning trajectory. Cognitive science indicates that human visual processing often begins with broad exploration to accumulate evidence before narrowing focus as the target becomes distinct. Our experiments reveal an analogous pattern in these models. This observation suggests that neither a fixed pruning schedule nor a heuristic layer-wise strategy can optimally accommodate the diverse complexities inherent in different inputs. To overcome this limitation, we introduce Complexity-Adaptive Pruning (AutoPrune), a training-free, plug-and-play framework that tailors pruning policies to varying sample and task complexities. Specifically, AutoPrune quantifies the mutual information between visual and textual tokens, then projects this signal to a budget-constrained logistic retention curve. Each such logistic curve, defined by its unique shape, corresponds to the specific complexity of different tasks and can guarantee adherence to predefined computational constraints. We evaluate AutoPrune on standard vision-language tasks and on Vision-Language-Action models for autonomous driving. Notably, when applied to LLaVA-1.5-7B, our method prunes 89% of visual tokens and reduces inference FLOPs by 76.8% while retaining 96.7% of the original accuracy averaged over all tasks. This corresponds to a 9.1% improvement over the recent work PDrop, demonstrating the effectiveness. Code is available at https://github.com/AutoLab-SAI-SJTU/AutoPrune.

[490] FrameMind: Frame-Interleaved Video Reasoning via Reinforcement Learning

Haonan Ge, Yiwei Wang, Kai-Wei Chang, Hang Wu, Yujun Cai

Main category: cs.CV

TL;DR: FrameMind is a reinforcement learning framework that enables video understanding models to dynamically request visual information during reasoning, outperforming traditional fixed-frame sampling approaches.

DetailsMotivation: Current video understanding models use fixed frame sampling strategies that don't adapt to specific reasoning requirements, limiting performance on tasks needing either broad temporal coverage or fine-grained spatial detail.

Method: FrameMind uses Frame-Interleaved Chain-of-Thought (FiCOT) for multi-turn reasoning alternating between textual reasoning and active visual perception, with Dynamic Resolution Frame Sampling (DRFS) and DRFS-GRPO policy optimization algorithm trained without frame-level annotations.

Result: Extensive experiments on MLVU and VideoMME benchmarks show FrameMind significantly outperforms existing models and advances state-of-the-art in flexible and efficient video understanding.

Conclusion: The proposed dynamic sampling approach enables adaptive visual evidence gathering, overcoming limitations of static frame sampling in video understanding tasks.

Abstract: Current video understanding models rely on fixed frame sampling strategies, processing predetermined visual inputs regardless of the specific reasoning requirements of each question. This static approach limits their ability to adaptively gather visual evidence, leading to suboptimal performance on tasks that require either broad temporal coverage or fine-grained spatial detail. In this paper, we introduce FrameMind, an end-to-end framework trained with reinforcement learning that enables models to dynamically request visual information during reasoning through Frame-Interleaved Chain-of-Thought (FiCOT). Unlike traditional approaches, FrameMind operates in multiple turns where the model alternates between textual reasoning and active visual perception, using tools to extract targeted frames or video clips based on identified knowledge gaps. To train effective dynamic sampling policies, we propose Dynamic Resolution Frame Sampling (DRFS), which exposes models to diverse temporal-spatial trade-offs during learning, and DRFS-GRPO, a group-relative policy optimization algorithm that learns from outcome-based rewards without requiring frame-level annotations. Extensive experiments on challenging benchmarks like MLVU and VideoMME demonstrate that our method significantly outperforms existing models, advancing the state of the art in flexible and efficient video understanding.

[491] Uncovering Grounding IDs: How External Cues Shape Multi-Modal Binding

Hosein Hasani, Amirmohammad Izadi, Fatemeh Askari, Mobin Bagherian, Sadegh Mohammadian, Mohammad Izadi, Mahdieh Soleymani Baghshah

Main category: cs.CV

TL;DR: Grounding IDs are latent identifiers that emerge when external visual structures are added to LVLMs, improving multimodal binding by reducing modality gaps and strengthening attention between related components.

DetailsMotivation: To understand why adding simple visual structures like partitions and annotations improves LVLM performance, and to uncover the internal mechanisms behind these gains.

Method: Used representation analysis to study embedding space alignment and conducted causal interventions to verify that Grounding IDs mediate binding between objects and symbolic cues.

Result: Found that Grounding IDs emerge as robust within-partition alignment, reduce modality gap between image and text, strengthen attention between related components, improve cross-modal grounding, and reduce hallucinations.

Conclusion: Grounding IDs are a key symbolic mechanism that explains how external cues enhance multimodal binding, offering both interpretability and practical improvements in robustness.

Abstract: Large vision-language models (LVLMs) show strong performance across multimodal benchmarks but remain limited in structured reasoning and precise grounding. Recent work has demonstrated that adding simple visual structures, such as partitions and annotations, improves accuracy, yet the internal mechanisms underlying these gains remain unclear. We investigate this phenomenon and propose the concept of Grounding IDs, latent identifiers induced by external cues that bind objects to their designated partitions across modalities. Through representation analysis, we find that these identifiers emerge as robust within-partition alignment in embedding space and reduce the modality gap between image and text. Causal interventions further confirm that these identifiers mediate binding between objects and symbolic cues. We show that Grounding IDs strengthen attention between related components, which in turn improves cross-modal grounding and reduces hallucinations. Taken together, our results identify Grounding IDs as a key symbolic mechanism explaining how external cues enhance multimodal binding, offering both interpretability and practical improvements in robustness.

[492] Towards Foundation Models for Cryo-ET Subtomogram Analysis

Runmin Jiang, Wanyue Feng, Yuntian Yang, Shriya Pingulkar, Hong Wang, Xi Xiao, Xiaoyu Cao, Genpei Zhang, Xiao Wang, Xiaolong Wu, Tianyang Wang, Yang Liu, Xingjian Li, Min Xu

Main category: cs.CV

TL;DR: This paper introduces the first foundation model for cryo-ET subtomogram analysis, addressing challenges of scarce annotations, severe noise, and poor generalization through large-scale synthetic data generation, adaptive phase tokenization, and noise-resilient contrastive learning.

DetailsMotivation: Cryo-ET enables in situ visualization of macromolecular structures, but effective analysis is hindered by scarce annotations, severe noise, and poor generalization in subtomogram classification, alignment, and averaging tasks.

Method: Three key components: 1) CryoEngine - large-scale synthetic data generator producing 904k subtomograms from 452 particle classes; 2) APT-ViT - Adaptive Phase Tokenization-enhanced Vision Transformer with equivariance-enhancing module; 3) NRCL - Noise-Resilient Contrastive Learning strategy for stable representation learning under noise.

Result: State-of-the-art performance across 24 synthetic and real datasets on all three major subtomogram tasks (classification, alignment, averaging) with strong generalization to unseen datasets.

Conclusion: The proposed foundation model advances scalable and robust subtomogram analysis in cryo-ET by addressing key challenges through synthetic data generation, adaptive architecture design, and noise-resilient learning strategies.

Abstract: Cryo-electron tomography (cryo-ET) enables in situ visualization of macromolecular structures, where subtomogram analysis tasks such as classification, alignment, and averaging are critical for structural determination. However, effective analysis is hindered by scarce annotations, severe noise, and poor generalization. To address these challenges, we take the first step towards foundation models for cryo-ET subtomograms. First, we introduce CryoEngine, a large-scale synthetic data generator that produces over 904k subtomograms from 452 particle classes for pretraining. Second, we design an Adaptive Phase Tokenization-enhanced Vision Transformer (APT-ViT), which incorporates adaptive phase tokenization as an equivariance-enhancing module that improves robustness to both geometric and semantic variations. Third, we introduce a Noise-Resilient Contrastive Learning (NRCL) strategy to stabilize representation learning under severe noise conditions. Evaluations across 24 synthetic and real datasets demonstrate state-of-the-art (SOTA) performance on all three major subtomogram tasks and strong generalization to unseen datasets, advancing scalable and robust subtomogram analysis in cryo-ET.

[493] ExGS: Extreme 3D Gaussian Compression with Diffusion Priors

Jiaqi Chen, Xinhao Ji, Yuanyuan Gao, Hao Li, Yuning Gong, Yifei Liu, Dan Xu, Zhihang Zhong, Dingwen Zhang, Xiao Sun

Main category: cs.CV

TL;DR: ExGS is a feed-forward framework for extreme 3D Gaussian Splatting compression that achieves over 100x compression while preserving rendering quality through Universal Gaussian Compression and GaussPainter with diffusion priors.

DetailsMotivation: Neural scene representations like 3DGS have high storage and transmission costs that hinder deployment in resource-constrained environments. Existing compression methods either require costly optimization or degrade quality under high compression ratios.

Method: ExGS unifies Universal Gaussian Compression (UGC) for re-optimization-free pruning to reduce Gaussian primitives, and GaussPainter which uses diffusion priors with mask-guided refinement to restore high-quality renderings from heavily pruned scenes.

Result: The framework achieves over 100x compression (reducing 354.77 MB to 3.31 MB) while preserving fidelity and significantly improving image quality under challenging conditions. It enables real-time restoration with lightweight VAE and one-step diffusion design.

Conclusion: Diffusion priors play a central role in bridging the gap between extreme compression and high-quality neural rendering, making ExGS a practical solution for resource-constrained environments.

Abstract: Neural scene representations, such as 3D Gaussian Splatting (3DGS), have enabled high-quality neural rendering; however, their large storage and transmission costs hinder deployment in resource-constrained environments. Existing compression methods either rely on costly optimization, which is slow and scene-specific, or adopt training-free pruning and quantization, which degrade rendering quality under high compression ratios. In contrast, recent data-driven approaches provide a promising direction to overcome this trade-off, enabling efficient compression while preserving high rendering quality.We introduce ExGS, a novel feed-forward framework that unifies Universal Gaussian Compression (UGC) with GaussPainter for Extreme 3DGS compression. UGC performs re-optimization-free pruning to aggressively reduce Gaussian primitives while retaining only essential information, whereas GaussPainter leverages powerful diffusion priors with mask-guided refinement to restore high-quality renderings from heavily pruned Gaussian scenes. Unlike conventional inpainting, GaussPainter not only fills in missing regions but also enhances visible pixels, yielding substantial improvements in degraded renderings.To ensure practicality, it adopts a lightweight VAE and a one-step diffusion design, enabling real-time restoration. Our framework can even achieve over 100X compression (reducing a typical 354.77 MB model to about 3.31 MB) while preserving fidelity and significantly improving image quality under challenging conditions. These results highlight the central role of diffusion priors in bridging the gap between extreme compression and high-quality neural rendering.Our code repository will be released at: https://github.com/chenttt2001/ExGS

[494] Self-Evolving Vision-Language Models for Image Quality Assessment via Voting and Ranking

Wen Wen, Tianwu Zhi, Kanglong Fan, Yang Li, Xinge Peng, Yabin Zhang, Yiting Liao, Junlin Li, Li Zhang

Main category: cs.CV

TL;DR: EvoQuality is a self-supervised framework that enables vision-language models to autonomously improve their image quality assessment capabilities without ground-truth labels, using self-consistency principles and iterative refinement.

DetailsMotivation: Traditional methods for improving vision-language models require costly human-annotated data. Self-supervised techniques like self-consistency have been effective for reasoning tasks but remain unexplored for perceptual domains like image quality assessment.

Method: EvoQuality adapts self-consistency to ranking-based IQA by generating pseudo-labels through pairwise majority voting on the VLM’s outputs. These pseudo-rankings create a fidelity reward that guides iterative evolution using group relative policy optimization (GRPO).

Result: EvoQuality boosts the base VLM’s zero-shot performance by 31.8% on PLCC across diverse IQA benchmarks. It achieves competitive or superior performance to state-of-the-art supervised VLM-based IQA models, outperforming them on 5 out of 7 benchmarks.

Conclusion: EvoQuality demonstrates that vision-language models can autonomously refine their perceptual capabilities through self-supervised learning, achieving remarkable performance without ground-truth labels and competing with supervised methods.

Abstract: Improving vision-language models (VLMs) in the post-training stage typically relies on supervised fine-tuning or reinforcement learning, methods that necessitate costly, human-annotated data. While self-supervised techniques such as self-consistency have proven effective for enhancing reasoning capabilities, their application to perceptual domains such as image quality assessment (IQA) remains largely unexplored. In this work, we introduce EvoQuality, a novel framework that enables a VLM to autonomously refine its quality perception capabilities without any ground-truth labels. EvoQuality adapts the principle of self-consistency to the ranking-based nature of IQA. It generates pseudo-labels by performing pairwise majority voting on the VLM’s own outputs to establish a consensus on relative quality. These pseudo-rankings are then formulated into a fidelity reward that guides the model’s iterative evolution through group relative policy optimization (GRPO). By iteratively leveraging its own predictions, EvoQuality progressively refines the VLM’s perceptual capability. Extensive experiments show that EvoQuality boosts the base VLM’s zero-shot performance by 31.8% on PLCC across diverse IQA benchmarks. Remarkably, despite being entirely self-supervised, EvoQuality achieves performance that is competitive with, or even surpasses, state-of-the-art supervised VLM-based IQA models, outperforming these models on 5 out of 7 IQA benchmarks.

[495] PAL-UI: Planning with Active Look-back for Vision-Based GUI Agents

Zikang Liu, Junyi Li, Wayne Xin Zhao, Dawei Gao, Yaliang Li, Ji-rong Wen

Main category: cs.CV

TL;DR: PAL-UI is a novel framework that enables GUI agents to adaptively retrieve past observations during long-horizon tasks, addressing memory limitations in multimodal GUI interaction.

DetailsMotivation: Existing GUI agents struggle with long-horizon tasks due to memory limitations, either truncating history or using simple textual summaries that risk losing critical visual information needed for future decisions.

Method: PAL-UI combines dual-level summarization (capturing observation-level cues and action-level outcomes) with a dedicated retrieval tool that allows agents to recall specific historical screenshots during planning. Models are trained on 8.6K mobile GUI navigation samples based on Qwen2.5-VL.

Result: PAL-UI significantly outperforms baseline models and prior methods in mobile GUI navigation tasks, even under data-efficient settings, and exhibits strong cross-domain generalization with notable improvements in web navigation without additional training.

Conclusion: The work demonstrates the potential of active memory retrieval for enhancing long-horizon planning capabilities of vision-based GUI agents.

Abstract: Graphical User Interface (GUI) agents powered by Multimodal Large Language Models (MLLMs) promise human-like interaction with software applications, yet long-horizon tasks remain challenging due to memory limitations. Existing approaches either truncate history or rely on simple textual summaries, which risk losing critical information when past visual details become necessary for future decisions. In this paper, we propose \textbf{PAL-UI} (\textbf{P}lanning with \textbf{A}ctive \textbf{L}ook-back), a novel framework that enables GUI agents to adaptively retrieve past observations when required. PAL-UI combines a dual-level summarization agent, capturing both observation-level cues and action-level outcomes, with a dedicated retrieval tool that allows the agent to recall specific historical screenshots during planning. We curate a step-level instruction dataset of 8.6K samples from mobile GUI navigation trajectories and train \textbf{PAL-UI-3B} and \textbf{PAL-UI-7B} models based on Qwen2.5-VL. Extensive experiments demonstrate that PAL-UI significantly outperforms baseline models and prior methods in mobile GUI navigation tasks, even under data-efficient settings. Moreover, PAL-UI exhibits strong cross-domain generalization, achieving notable improvements in web navigation without additional training. Our work highlights the potential of active memory retrieval for long-horizon planning capabilities of vision-based GUI agents.

[496] Erased, But Not Forgotten: Erased Rectified Flow Transformers Still Remain Unsafe Under Concept Attack

Nanxiang Jiang, Zhaoxin Fan, Enhan Kang, Daiheng Gao, Yun Zhou, Yanxia Chang, Zheng Zhu, Yeying Jin, Wenjun Wu

Main category: cs.CV

TL;DR: ReFlux is a concept attack method designed to test the robustness of concept erasure in rectified flow-based text-to-image models like Flux, using reverse-attention optimization and velocity guidance to reactivate suppressed concepts.

DetailsMotivation: Existing concept erasure methods and attack evaluations are tailored for Stable Diffusion and show limited effectiveness when transferred to next-generation rectified flow transformers like Flux, creating a need for specialized assessment tools.

Method: The approach uses reverse-attention optimization to reactivate suppressed signals while stabilizing attention, reinforced by velocity-guided dynamics for robust concept reactivation and consistency-preserving objectives to maintain global layout.

Result: Extensive experiments demonstrate the method’s effectiveness and efficiency in attacking concept erasure in Flux models, establishing a reliable benchmark for evaluating erasure robustness.

Conclusion: ReFlux provides the first specialized attack method for assessing concept erasure robustness in rectified flow transformers, revealing vulnerabilities in current erasure techniques and setting a benchmark for future safety evaluations.

Abstract: Recent advances in text-to-image (T2I) diffusion models have enabled impressive generative capabilities, but they also raise significant safety concerns due to the potential to produce harmful or undesirable content. While concept erasure has been explored as a mitigation strategy, most existing approaches and corresponding attack evaluations are tailored to Stable Diffusion (SD) and exhibit limited effectiveness when transferred to next-generation rectified flow transformers such as Flux. In this work, we present ReFlux, the first concept attack method specifically designed to assess the robustness of concept erasure in the latest rectified flow-based T2I framework. Our approach is motivated by the observation that existing concept erasure techniques, when applied to Flux, fundamentally rely on a phenomenon known as attention localization. Building on this insight, we propose a simple yet effective attack strategy that specifically targets this property. At its core, a reverse-attention optimization strategy is introduced to effectively reactivate suppressed signals while stabilizing attention. This is further reinforced by a velocity-guided dynamic that enhances the robustness of concept reactivation by steering the flow matching process, and a consistency-preserving objective that maintains the global layout and preserves unrelated content. Extensive experiments consistently demonstrate the effectiveness and efficiency of the proposed attack method, establishing a reliable benchmark for evaluating the robustness of concept erasure strategies in rectified flow transformers.

[497] VirDA: Reusing Backbone for Unsupervised Domain Adaptation with Visual Reprogramming

Duy Nguyen, Dat Nguyen

Main category: cs.CV

TL;DR: VirDA is a parameter-efficient UDA method that uses visual reprogramming layers to adapt input images to target domains instead of fine-tuning backbone parameters, achieving competitive accuracy with significantly fewer trainable parameters.

DetailsMotivation: Existing UDA methods require fine-tuning backbone parameters for each new source-target pair, leading to linear growth in parameters and storage, and preventing backbone reuse.

Method: VirDA prepends domain-specific visual reprogramming layers to the backbone that produce visual prompts acting as textural bias to adapt input images to target domains, using multiple objective functions to optimize intra- and inter-domain distribution differences without modifying backbone parameters.

Result: On Office-31, VirDA achieves 92.8% mean accuracy with only 1.5M trainable parameters, surpassing PDA by +1.6% accuracy using 46% of its parameters, and outperforms full-backbone fine-tuning methods like CDTrans and FixBi while requiring only 1.7% and 2.8% of their parameters respectively.

Conclusion: VirDA enables efficient domain adaptation by reusing backbone parameters across domains through visual reprogramming, achieving competitive performance with dramatically reduced parameter requirements compared to existing methods.

Abstract: Existing UDA pipelines fine-tune already well-trained backbone parameters for every new source-and-target pair, resulting in the number of training parameters and storage memory growing linearly with each new pair, and also preventing the reuse of these well-trained backbone parameters. Inspired by recent implications that existing backbones have textural biases, we propose making use of domain-specific textural bias for domain adaptation via visual reprogramming, namely VirDA. Instead of fine-tuning the full backbone, VirDA prepends a domain-specific visual reprogramming layer to the backbone. This layer produces visual prompts that act as an added textural bias to the input image, adapting its “style” to a target domain. To optimize these visual reprogramming layers, we use multiple objective functions that optimize the intra- and inter-domain distribution differences when domain-adapting visual prompts are applied. This process does not require modifying the backbone parameters, allowing the same backbone to be reused across different domains. We evaluate VirDA on Office-31 and obtain 92.8% mean accuracy with only 1.5M trainable parameters. VirDA surpasses PDA, the state-of-the-art parameter-efficient UDA baseline, by +1.6% accuracy while using just 46% of its parameters. Compared with full-backbone fine-tuning, VirDA outperforms CDTrans and FixBi by +0.2% and +1.4%, respectively, while requiring only 1.7% and 2.8% of their trainable parameters. Relative to the strongest current methods (PMTrans and TVT), VirDA uses ~1.7% of their parameters and trades off only 2.2% and 1.1% accuracy, respectively.

[498] Automated Defect Detection for Mass-Produced Electronic Components Based on YOLO Object Detection Models

Wei-Lung Mao, Chun-Chi Wang, Po-Heng Chou, Yen-Ting Liu

Main category: cs.CV

TL;DR: Automated defect detection system for DIP components using digital camera optics and deep learning, with ConSinGAN for data augmentation and YOLOv7 achieving 95.50% accuracy.

DetailsMotivation: Conventional industry component defect detection is time-consuming and labor-intensive, creating burden on quality inspection personnel and making product quality management difficult.

Method: Uses digital camera optics and deep learning-based model with ConSinGAN for dataset generation. Investigates four YOLO models (v3, v4, v7, v9) with and without ConSinGAN augmentation. Develops SCADA system with sensor architecture.

Result: YOLOv7 with ConSinGAN achieves superior performance with 95.50% accuracy and 285 ms detection time, outperforming other YOLO versions and threshold-based approaches.

Conclusion: The proposed automated defect detection system can be easily established for various defect types or when defect data is insufficient, providing an efficient solution for industrial quality inspection.

Abstract: Since the defect detection of conventional industry components is time-consuming and labor-intensive, it leads to a significant burden on quality inspection personnel and makes it difficult to manage product quality. In this paper, we propose an automated defect detection system for the dual in-line package (DIP) that is widely used in industry, using digital camera optics and a deep learning (DL)-based model. The two most common defect categories of DIP are examined: (1) surface defects, and (2) pin-leg defects. However, the lack of defective component images leads to a challenge for detection tasks. To solve this problem, the ConSinGAN is used to generate a suitable-sized dataset for training and testing. Four varieties of the YOLO model are investigated (v3, v4, v7, and v9), both in isolation and with the ConSinGAN augmentation. The proposed YOLOv7 with ConSinGAN is superior to the other YOLO versions in accuracy of 95.50%, detection time of 285 ms, and is far superior to threshold-based approaches. In addition, the supervisory control and data acquisition (SCADA) system is developed, and the associated sensor architecture is described. The proposed automated defect detection can be easily established with numerous types of defects or insufficient defect data.

[499] Generating Findings for Jaw Cysts in Dental Panoramic Radiographs Using GPT-4o: Building a Two-Stage Self-Correction Loop with Structured Output (SLSO) Framework

Nanaka Hosokawa, Ryo Takahashi, Tomoya Kitano, Yukihiro Iida, Chisako Muramatsu, Tatsuro Hayashi, Yuta Seino, Xiangrong Zhou, Takeshi Hara, Akitoshi Katsumata, Hiroshi Fujita

Main category: cs.CV

TL;DR: The study developed a Self-correction Loop with Structured Output (SLSO) framework using GPT-4o to automatically generate jaw cyst findings from dental panoramic radiographs, showing improved accuracy over conventional methods.

DetailsMotivation: To improve the accuracy of automated jaw cyst findings generation from dental panoramic radiographs by addressing inconsistencies and hallucinations in AI-generated outputs.

Method: Implemented a 10-step SLSO framework including image analysis, structured data generation, tooth number extraction and consistency checking, iterative regeneration when inconsistencies detected, and finding generation with restructuring and verification. Compared against conventional Chain-of-Thought method.

Result: SLSO improved output accuracy with 66.9%, 33.3%, and 28.6% improvement rates for tooth number, tooth movement, and root resorption respectively. Successful cases achieved consistent structured output after up to five regenerations. Framework enforced negative finding descriptions and suppressed hallucinations.

Conclusion: SLSO framework shows promise for improving automated finding generation but requires further refinement for extensive lesions spanning multiple teeth and overall performance enhancement for practical clinical use.

Abstract: In this study, we utilized the multimodal capabilities of OpenAI GPT-4o to automatically generate jaw cyst findings on dental panoramic radiographs. To improve accuracy, we constructed a Self-correction Loop with Structured Output (SLSO) framework and verified its effectiveness. A 10-step process was implemented for 22 cases of jaw cysts, including image input and analysis, structured data generation, tooth number extraction and consistency checking, iterative regeneration when inconsistencies were detected, and finding generation with subsequent restructuring and consistency verification. A comparative experiment was conducted using the conventional Chain-of-Thought (CoT) method across seven evaluation items: transparency, internal structure, borders, root resorption, tooth movement, relationships with other structures, and tooth number. The results showed that the proposed SLSO framework improved output accuracy for many items, with 66.9%, 33.3%, and 28.6% improvement rates for tooth number, tooth movement, and root resorption, respectively. In the successful cases, a consistently structured output was achieved after up to five regenerations. Although statistical significance was not reached because of the small size of the dataset, the overall SLSO framework enforced negative finding descriptions, suppressed hallucinations, and improved tooth number identification accuracy. However, the accurate identification of extensive lesions spanning multiple teeth is limited. Nevertheless, further refinement is required to enhance overall performance and move toward a practical finding generation system.

[500] VidGuard-R1: AI-Generated Video Detection and Explanation via Reasoning MLLMs and RL

Kyoungjun Park, Yifan Yang, Juheon Yi, Shicheng Zheng, Yifei Shen, Dongqi Han, Caihua Shan, Muhammad Muaz, Lili Qiu

Main category: cs.CV

TL;DR: VidGuard-R1 is the first video authenticity detector that fine-tunes a multi-modal large language model using group relative policy optimization to provide both accurate detection and interpretable explanations for AI-generated videos.

DetailsMotivation: Address the urgent need for effective detection tools to mitigate societal risks from AI-generated videos, such as misinformation and reputational harm, while ensuring transparency through interpretable explanations for regulators and end users.

Method: Fine-tunes Qwen-VL using group relative policy optimization (GRPO) with two specialized reward models targeting temporal artifacts and generation complexity. Uses a curated dataset of 140k real and AI-generated videos designed to maximize discrimination difficulty.

Result: Achieves state-of-the-art zero-shot performance on existing benchmarks, with additional training pushing accuracy above 95%. Produces precise and interpretable rationales behind predictions.

Conclusion: VidGuard-R1 successfully addresses the need for both accurate detection and interpretable explanations in AI-generated video detection, demonstrating superior performance through multi-modal fine-tuning with specialized reward models.

Abstract: With the rapid advancement of AI-generated videos, there is an urgent need for effective detection tools to mitigate societal risks such as misinformation and reputational harm. In addition to accurate classification, it is essential that detection models provide interpretable explanations to ensure transparency for regulators and end users. To address these challenges, we introduce VidGuard-R1, the first video authenticity detector that fine-tunes a multi-modal large language model (MLLM) using group relative policy optimization (GRPO). Our model delivers both highly accurate judgments and insightful reasoning. We curate a challenging dataset of 140k real and AI-generated videos produced by state-of-the-art generation models, carefully designing the generation process to maximize discrimination difficulty. We then fine-tune Qwen-VL using GRPO with two specialized reward models that target temporal artifacts and generation complexity. Extensive experiments demonstrate that VidGuard-R1 achieves state-of-the-art zero-shot performance on existing benchmarks, with additional training pushing accuracy above 95%. Case studies further show that VidGuard-R1 produces precise and interpretable rationales behind its predictions. The code is publicly available at https://VidGuard-R1.github.io.

[501] One Patch to Caption Them All: A Unified Zero-Shot Captioning Framework

Lorenzo Bianchi, Giacomo Pacini, Fabio Carrara, Nicola Messina, Giuseppe Amato, Fabrizio Falchi

Main category: cs.CV

TL;DR: Patch-ioner is a zero-shot captioning framework that shifts from image-centric to patch-centric paradigm, enabling captioning of arbitrary regions without region-level supervision by treating individual patches as atomic captioning units.

DetailsMotivation: Current zero-shot captioners only use global image representations and generate whole-image captions, limiting their ability to describe specific regions or non-contiguous areas.

Method: Treat individual patches as atomic captioning units and aggregate them to describe arbitrary regions (single patches, non-contiguous areas, entire images) using dense visual features from backbones like DINO.

Result: Achieves state-of-the-art performance on zero-shot dense, region-set, and trace captioning tasks, outperforming other baselines and competitors.

Conclusion: Patch-wise semantic representations are effective for scalable caption generation, with dense visual features from backbones like DINO being key to performance.

Abstract: Zero-shot captioners are recently proposed models that utilize common-space vision-language representations to caption images without relying on paired image-text data. To caption an image, they proceed by textually decoding a text-aligned image feature, but they limit their scope to global representations and whole-image captions. We present Patch-ioner, a unified framework for zero-shot captioning that shifts from an image-centric to a patch-centric paradigm, enabling the captioning of arbitrary regions without the need of region-level supervision. Instead of relying on global image representations, we treat individual patches as atomic captioning units and aggregate them to describe arbitrary regions, from single patches to non-contiguous areas and entire images. We analyze the key ingredients that enable current latent captioners to work in our novel proposed framework. Experiments demonstrate that backbones producing meaningful, dense visual features, such as DINO, are key to achieving state-of-the-art performance in multiple region-based captioning tasks. Compared to other baselines and state-of-the-art competitors, our models achieve better performance on zero-shot dense, region-set, and a newly introduced trace captioning task, highlighting the effectiveness of patch-wise semantic representations for scalable caption generation. Project page at https://paciosoft.com/Patch-ioner/ .

[502] What Drives Compositional Generalization in Visual Generative Models?

Karim Farid, Rajat Sahay, Yumna Ali Alnaggar, Simon Schrodi, Volker Fischer, Cordelia Schmid, Thomas Brox

Main category: cs.CV

TL;DR: The paper studies how design choices affect compositional generalization in visual generative models, identifying training objective type (discrete vs continuous) and conditioning information as key factors, and proposes improving discrete models with auxiliary continuous objectives.

DetailsMotivation: To systematically understand mechanisms that enable or inhibit compositional generalization in visual generative models, as this ability to generate novel combinations of known concepts is crucial but not fully understood.

Method: Conducted controlled experiments to study design choices, then proposed relaxing MaskGIT’s discrete loss with an auxiliary continuous JEPA-based objective to improve compositional performance.

Result: Identified two key factors influencing compositional generalization: (i) whether training objective operates on discrete vs continuous distribution, and (ii) extent of conditioning information about constituent concepts during training.

Conclusion: Auxiliary continuous objectives can improve compositional performance in discrete models like MaskGIT, bridging the gap between discrete and continuous approaches for better compositional generalization.

Abstract: Compositional generalization, the ability to generate novel combinations of known concepts, is a key ingredient for visual generative models. Yet, not all mechanisms that enable or inhibit it are fully understood. In this work, we conduct a systematic study of how various design choices influence compositional generalization in image and video generation in a positive or negative way. Through controlled experiments, we identify two key factors: (i) whether the training objective operates on a discrete or continuous distribution, and (ii) to what extent conditioning provides information about the constituent concepts during training. Building on these insights, we show that relaxing the MaskGIT discrete loss with an auxiliary continuous JEPA-based objective can improve compositional performance in discrete models like MaskGIT.

cs.AI

[503] WAREX: Web Agent Reliability Evaluation on Existing Benchmarks

Su Kara, Fazle Faisal, Suman Nath

Main category: cs.AI

TL;DR: WAREX evaluates web agent reliability by introducing real-world network instability and security threats to existing benchmarks, revealing significant performance drops in state-of-the-art agents.

DetailsMotivation: Current benchmarks test agents in controlled environments, but real-world web interactions face network instability, HTTPS issues, and security threats like XSS attacks that aren't captured in existing evaluations.

Method: WAREX extends three popular benchmarks (WebArena, WebVoyager, REAL) by introducing real-world network conditions, HTTPS instability, and web security threats including Cross-Site Scripting attacks and malicious pop-ups.

Result: Experiments show significant drops in task success rates when WAREX conditions are introduced, demonstrating limited robustness of current state-of-the-art web agents.

Conclusion: Existing web agent benchmarks fail to capture real-world reliability challenges, and current agents lack robustness against network instability and security threats that occur in practical web environments.

Abstract: Recent advances in browser-based LLM agents have shown promise for automating tasks ranging from simple form filling to hotel booking or online shopping. Current benchmarks measure agent performance in controlled environments, such as containers or stable networks, where websites behave deterministically. However, in the real world, users access websites over networks and HTTPS connections that introduce instability from multiple sources: client-side, server-side issues or broader system failures. Moreover, live websites are prone to web attacks such Cross-Site Scripting, as well as general site modifications which can cause unexpected or malicious pop-ups or improper functionality. To address this gap, we present WAREX: Web Agent Reliability Evaluation on Existing Benchmarks. We measure the impact of WAREX across three popular benchmarks: WebArena, WebVoyager, and REAL. Our experiments show that introducing WAREX leads to significant drops in task success rates, highlighting the limited robustness of state-of-the-art agents.

[504] Refined Iterated Pareto Greedy for Energy-aware Hybrid Flowshop Scheduling with Blocking Constraints

Ahmed Missaoui, Cemalettin Ozturk, Barry O’Sullivan

Main category: cs.AI

TL;DR: This paper addresses energy-efficient scheduling in manufacturing by solving a hybrid flow shop problem with blocking constraints, aiming to minimize both makespan and energy consumption using multi-objective optimization methods.

DetailsMotivation: The scarcity of non-renewable energy sources, geopolitical supply issues, rising prices, and climate change pressures drive the need for energy-efficient manufacturing solutions, particularly in scheduling which can be quickly deployed with immediate impact.

Method: The authors formulated a multi-objective mixed integer programming model and proposed two approaches: an augmented epsilon-constraint method for finding Pareto-optimal solutions, and a Refined Iterated Pareto Greedy (RIPG) metaheuristic algorithm for solving large instances efficiently.

Result: The proposed methods were benchmarked across small, medium, and large-size instances and compared against two well-known algorithms, with computational results demonstrating the effectiveness of the approach.

Conclusion: The developed multi-objective optimization methods effectively address the conflicting objectives of minimizing makespan and energy consumption in hybrid flow shop scheduling with blocking constraints, providing practical solutions for energy-efficient manufacturing operations.

Abstract: The scarcity of non-renewable energy sources, geopolitical problems in its supply, increasing prices, and the impact of climate change, force the global economy to develop more energy-efficient solutions for their operations. The Manufacturing sector is not excluded from this challenge as one of the largest consumers of energy. Energy-efficient scheduling is a method that attracts manufacturing companies to reduce their consumption as it can be quickly deployed and can show impact immediately. In this study, the hybrid flow shop scheduling problem with blocking constraint (BHFS) is investigated in which we seek to minimize the latest completion time (i.e. makespan) and overall energy consumption, a typical manufacturing setting across many industries from automotive to pharmaceutical. Energy consumption and the latest completion time of customer orders are usually conflicting objectives. Therefore, we first formulate the problem as a novel multi-objective mixed integer programming (MIP) model and propose an augmented epsilon-constraint method for finding the Pareto-optimal solutions. Also, an effective multi-objective metaheuristic algorithm. Refined Iterated Pareto Greedy (RIPG), is developed to solve large instances in reasonable time. Our proposed methods are benchmarked using small, medium, and large-size instances to evaluate their efficiency. Two well-known algorithms are adopted for comparing our novel approaches. The computational results show the effectiveness of our method.

[505] Know Thyself? On the Incapability and Implications of AI Self-Recognition

Xiaoyan Bai, Aryan Shrivastava, Ari Holtzman, Chenhao Tan

Main category: cs.AI

TL;DR: This paper evaluates self-recognition in 10 contemporary LLMs, finding consistent failure in identifying their own generated text, with performance rarely above random chance and strong bias toward predicting GPT and Claude families.

DetailsMotivation: To resolve contradictory interpretations about whether models possess self-recognition capabilities, which is crucial for AI safety and metacognitive analysis.

Method: Systematic evaluation framework with two tasks: binary self-recognition (identifying own vs. other model’s text) and exact model prediction, applied to 10 contemporary LLMs.

Result: Only 4 out of 10 models predicted themselves as generators, performance rarely above random chance, strong bias toward GPT and Claude families. Models show some knowledge of own existence but reasoning reveals hierarchical bias associating high-quality text with top-tier models.

Conclusion: Findings highlight limitations in current LLM self-recognition capabilities, with implications for AI safety and need to develop appropriate AI self-awareness.

Abstract: Self-recognition is a crucial metacognitive capability for AI systems, relevant not only for psychological analysis but also for safety, particularly in evaluative scenarios. Motivated by contradictory interpretations of whether models possess self-recognition (Panickssery et al., 2024; Davidson et al., 2024), we introduce a systematic evaluation framework that can be easily applied and updated. Specifically, we measure how well 10 contemporary larger language models (LLMs) can identify their own generated text versus text from other models through two tasks: binary self-recognition and exact model prediction. Different from prior claims, our results reveal a consistent failure in self-recognition. Only 4 out of 10 models predict themselves as generators, and the performance is rarely above random chance. Additionally, models exhibit a strong bias toward predicting GPT and Claude families. We also provide the first evaluation of model awareness of their own and others’ existence, as well as the reasoning behind their choices in self-recognition. We find that the model demonstrates some knowledge of its own existence and other models, but their reasoning reveals a hierarchical bias. They appear to assume that GPT, Claude, and occasionally Gemini are the top-tier models, often associating high-quality text with them. We conclude by discussing the implications of our findings on AI safety and future directions to develop appropriate AI self-awareness.

[506] ContraGen: A Multi-Agent Generation Framework for Enterprise Contradictions Detection

Ananya Mantravadi, Shivali Dalmia, Abhishek Mukherji, Nand Dave, Anudha Mittal

Main category: cs.AI

TL;DR: ContraGen is a contradiction-aware benchmark framework for enterprise RAG systems that generates synthetic enterprise documents with embedded contradictions to evaluate consistency.

DetailsMotivation: Existing benchmarks for contradiction detection are limited to sentence-level analysis and don't capture the complexity of enterprise documents like contracts, financial filings, and compliance reports, which is problematic for enterprise RAG systems where compliance and accountability are critical.

Method: The framework generates synthetic enterprise-style documents with embedded contradictions, combines automated contradiction mining with human-in-the-loop validation, models a taxonomy of contradiction types common in business processes, and enables controlled creation of self- and pairwise contradictions.

Result: The work establishes a foundation for more trustworthy and accountable RAG systems in enterprise applications by developing a contradiction-aware retrieval evaluation pipeline with human oversight to reflect domain-specific judgment complexity.

Conclusion: ContraGen addresses the limitation of existing benchmarks and provides a systematic approach for evaluating both intra-document and cross-document consistency, which is essential for reducing risk and ensuring compliance in enterprise information-seeking applications.

Abstract: Retrieval-Augmented Generation (RAG) integrates LLMs with external sources, offering advanced capabilities for information access and decision-making. However, contradictions in retrieved evidence can result in inconsistent or untrustworthy outputs, which is especially problematic in enterprise settings where compliance, governance, and accountability are critical. Existing benchmarks for contradiction detection are limited to sentence-level analysis and do not capture the complexity of enterprise documents such as contracts, financial filings, compliance reports, or policy manuals. To address this limitation, we propose ContraGen, a contradiction-aware benchmark framework tailored to enterprise domain. The framework generates synthetic enterprise-style documents with embedded contradictions, enabling systematic evaluation of both intra-document and cross-document consistency. Automated contradiction mining is combined with human-in-the-loop validation to ensure high accuracy. Our contributions include generating realistic enterprise documents, modeling a taxonomy of contradiction types common in business processes, enabling controlled creation of self- and pairwise contradictions, developing a contradiction-aware retrieval evaluation pipeline and embedding human oversight to reflect domain-specific judgment complexity. This work establishes a foundation for more trustworthy and accountable RAG systems in enterprise information-seeking applications, where detecting and resolving contradictions is essential for reducing risk and ensuring compliance.

[507] A Qualitative Comparative Evaluation of Cognitive and Generative Theories

Paul S. Rosenbloom

Main category: cs.AI

TL;DR: This paper presents a qualitative comparison of cognitive and generative neural architectures for whole-mind systems, addressing the challenges of theory evaluation in both approaches.

DetailsMotivation: Evaluation is challenging for both cognitive architectures and generative neural architectures, creating a dual challenge that needs to be addressed through systematic comparison.

Method: The study uses a broad perspective on theory evaluation to conduct a wide-ranging qualitative comparison of whole-mind-oriented cognitive and generative architectures and their full systems.

Result: The paper yields a comprehensive comparison framework for evaluating theories based on different architectural approaches to whole-mind modeling.

Conclusion: A broad perspective on theory evaluation enables meaningful qualitative comparison between cognitive and generative architectures for whole-mind systems, addressing the evaluation challenges in both approaches.

Abstract: Evaluation is a critical activity associated with any theory. Yet this has proven to be an exceptionally challenging activity for theories based on cognitive architectures. For an overlapping set of reasons, evaluation can also be challenging for theories based on generative neural architectures. This dual challenge is approached here by leveraging a broad perspective on theory evaluation to yield a wide-ranging, albeit qualitative, comparison of whole-mind-oriented cognitive and generative architectures and the full systems that are based on these architectures.

[508] Bridging LLM Planning Agents and Formal Methods: A Case Study in Plan Verification

Keshav Ramani, Vali Tawosi, Salwa Alamir, Daniel Borrajo

Main category: cs.AI

TL;DR: A framework that uses LLMs to convert natural language plans into Kripke structures and LTL formulas, then performs model checking to evaluate plan alignment with expected behavior.

DetailsMotivation: To systematically evaluate the alignment between natural language plans and their expected behavior through formal verification methods.

Method: Convert natural language plans into Kripke structures and Linear Temporal Logic (LTL) formulas using Large Language Models, then perform model checking on a simplified PlanBench dataset.

Result: GPT-5 achieves excellent classification performance with 96.3% F1 score and produces syntactically perfect formal representations that can serve as guarantees.

Conclusion: The framework successfully demonstrates high performance in plan verification, though achieving semantically perfect formal models requires further exploration.

Abstract: We introduce a novel framework for evaluating the alignment between natural language plans and their expected behavior by converting them into Kripke structures and Linear Temporal Logic (LTL) using Large Language Models (LLMs) and performing model checking. We systematically evaluate this framework on a simplified version of the PlanBench plan verification dataset and report on metrics like Accuracy, Precision, Recall and F1 scores. Our experiments demonstrate that GPT-5 achieves excellent classification performance (F1 score of 96.3%) while almost always producing syntactically perfect formal representations that can act as guarantees. However, the synthesis of semantically perfect formal models remains an area for future exploration.

[509] Towards Policy-Compliant Agents: Learning Efficient Guardrails For Policy Violation Detection

Xiaofei Wen, Wenjie Jacky Mo, Yanan Xie, Peng Qi, Muhao Chen

Main category: cs.AI

TL;DR: PolicyGuardBench is a benchmark for detecting policy violations in web agent trajectories, with PolicyGuard-4B as a lightweight guardrail model that achieves strong detection accuracy and generalization across domains.

DetailsMotivation: Autonomous web agents need to operate under external policies, but little work has examined whether their trajectories comply with such policies across different contexts like domains and subdomains.

Method: Created PolicyGuardBench with 60k examples from diverse agent runs, generating policies and violation labels for within and cross subdomain pairings. Includes full-trajectory evaluation and prefix-based violation detection. Trained PolicyGuard-4B model on this dataset.

Result: PolicyGuard-4B delivers strong detection accuracy across all tasks while keeping inference efficient, generalizes across domains, and preserves high accuracy on unseen settings.

Conclusion: The framework shows that accurate and generalizable guardrails for policy compliance in web agent trajectories are feasible at small scales.

Abstract: Autonomous web agents need to operate under externally imposed or human-specified policies while generating long-horizon trajectories. However, little work has examined whether these trajectories comply with such policies, or whether policy violations persist across different contexts such as domains (e.g., shopping or coding websites) and subdomains (e.g., product search and order management in shopping). To address this gap, we introduce PolicyGuardBench, a benchmark of about 60k examples for detecting policy violations in agent trajectories. From diverse agent runs, we generate a broad set of policies and create both within subdomain and cross subdomain pairings with violation labels. In addition to full-trajectory evaluation, PolicyGuardBench also includes a prefix-based violation detection task where models must anticipate policy violations from truncated trajectory prefixes rather than complete sequences. Using this dataset, we train PolicyGuard-4B, a lightweight guardrail model that delivers strong detection accuracy across all tasks while keeping inference efficient. Notably, PolicyGuard-4B generalizes across domains and preserves high accuracy on unseen settings. Together, PolicyGuardBench and PolicyGuard-4B provide the first comprehensive framework for studying policy compliance in web agent trajectories, and show that accurate and generalizable guardrails are feasible at small scales.

[510] OneFlow: Concurrent Mixed-Modal and Interleaved Generation with Edit Flows

John Nguyen, Marton Havasi, Tariq Berrada, Luke Zettlemoyer, Ricky T. Q. Chen

Main category: cs.AI

TL;DR: OneFlow is the first non-autoregressive multimodal model that enables concurrent text-image generation using insertion-based Edit Flow for text and Flow Matching for images, outperforming autoregressive and diffusion models with fewer training FLOPs.

DetailsMotivation: To overcome the limitations of autoregressive models that enforce rigid causal ordering between text and image generation, enabling more flexible and efficient multimodal generation.

Method: Combines insertion-based Edit Flow for discrete text tokens with Flow Matching for image latents, using hierarchical sampling that prioritizes content over grammar for concurrent text-image synthesis.

Result: Outperforms autoregressive baselines on both generation and understanding tasks while using up to 50% fewer training FLOPs, and surpasses both autoregressive and diffusion-based approaches.

Conclusion: OneFlow unlocks new capabilities for concurrent generation, iterative refinement, and natural reasoning-like generation, demonstrating superior performance and efficiency compared to existing approaches.

Abstract: We present OneFlow, the first non-autoregressive multimodal model that enables variable-length and concurrent mixed-modal generation. Unlike autoregressive models that enforce rigid causal ordering between text and image generation, OneFlow combines an insertion-based Edit Flow for discrete text tokens with Flow Matching for image latents. OneFlow enables concurrent text-image synthesis with hierarchical sampling that prioritizes content over grammar. Through controlled experiments across model sizes from 1B to 8B, we demonstrate that OneFlow outperforms autoregressive baselines on both generation and understanding tasks while using up to 50% fewer training FLOPs. OneFlow surpasses both autoregressive and diffusion-based approaches while unlocking new capabilities for concurrent generation, iterative refinement, and natural reasoning-like generation.

[511] Understanding the Role of Training Data in Test-Time Scaling

Adel Javanmard, Baharan Mirzasoleiman, Vahab Mirrokni

Main category: cs.AI

TL;DR: Test-time scaling improves LLM reasoning by allowing longer Chains-of-Thought, but its effectiveness depends on training data conditions. The paper provides theoretical analysis showing test-time compute can reduce required context length, but may harm performance if skills are missing from training data.

DetailsMotivation: To understand when and why long Chains-of-Thought improve reasoning performance, and the conditions in training data that enable effective test-time scaling, as demonstrated by models like OpenAI's o1 and DeepSeek R1.

Method: Theoretical analysis of transformers trained on in-context weight prediction for linear regression, characterizing task hardness via smallest eigenvalue of feature covariance matrix, and experimental validation on nonlinear transformers.

Result: Test-time compute allows reducing context length for same error; can harm performance if required skills are absent from training; diverse, relevant, hard tasks yield best test-time scaling performance.

Conclusion: Test-time scaling effectiveness depends critically on training data quality - diverse, relevant, and challenging tasks enable models to leverage longer reasoning chains effectively during inference.

Abstract: Test-time scaling improves the reasoning capabilities of large language models (LLMs) by allocating extra compute to generate longer Chains-of-Thoughts (CoTs). This enables models to tackle more complex problem by breaking them down into additional steps, backtracking, and correcting mistakes. Despite its strong performance–demonstrated by OpenAI’s o1 and DeepSeek R1, the conditions in the training data under which long CoTs emerge, and when such long CoTs improve the performance, remain unclear. In this paper, we study the performance of test-time scaling for transformers trained on an in-context weight prediction task for linear regression. Our analysis provides a theoretical explanation for several intriguing observations: First, at any fixed test error, increasing test-time compute allows us to reduce the number of in-context examples (context length) in training prompts. Second, if the skills required to solve a downstream task are not sufficiently present in the training data, increasing test-time compute can harm performance. Finally, we characterize task hardness via the smallest eigenvalue of its feature covariance matrix and show that training on a diverse, relevant, and hard set of tasks results in best performance for test-time scaling. We confirm our findings with experiments on large, nonlinear transformer architectures.

[512] Cross-Modal Content Optimization for Steering Web Agent Preferences

Tanqiu Jiang, Min Bai, Nikolaos Pappas, Yanjun Qi, Sandesh Swamy

Main category: cs.AI

TL;DR: Cross-Modal Preference Steering (CPS) is a black-box attack method that jointly manipulates visual and textual content to bias VLM-based web agents’ selection decisions, outperforming existing methods while maintaining stealth.

DetailsMotivation: VLM-based web agents are vulnerable to preference manipulation attacks, but existing methods have unrealistic assumptions like white-box access or single-modal perturbations. There's a need for more realistic and effective attack methods.

Method: CPS jointly optimizes imperceptible modifications to both visual and textual content using CLIP-transferable image perturbations and RLHF-induced linguistic biases, operating in a realistic black-box setting where attackers can only edit their own listing’s images and text metadata.

Result: CPS significantly outperforms baseline methods across state-of-the-art VLMs (GPT-4.1, Qwen-2.5VL, Pixtral-Large) on movie selection and e-commerce tasks, while maintaining 70% lower detection rates.

Conclusion: The effectiveness of CPS highlights an urgent need for robust defenses as VLM-based agents play increasingly important roles in high-stakes selection tasks like content recommendation and product ranking.

Abstract: Vision-language model (VLM)-based web agents increasingly power high-stakes selection tasks like content recommendation or product ranking by combining multimodal perception with preference reasoning. Recent studies reveal that these agents are vulnerable against attackers who can bias selection outcomes through preference manipulations using adversarial pop-ups, image perturbations, or content tweaks. Existing work, however, either assumes strong white-box access, with limited single-modal perturbations, or uses impractical settings. In this paper, we demonstrate, for the first time, that joint exploitation of visual and textual channels yields significantly more powerful preference manipulations under realistic attacker capabilities. We introduce Cross-Modal Preference Steering (CPS) that jointly optimizes imperceptible modifications to an item’s visual and natural language descriptions, exploiting CLIP-transferable image perturbations and RLHF-induced linguistic biases to steer agent decisions. In contrast to prior studies that assume gradient access, or control over webpages, or agent memory, we adopt a realistic black-box threat setup: a non-privileged adversary can edit only their own listing’s images and textual metadata, with no insight into the agent’s model internals. We evaluate CPS on agents powered by state-of-the-art proprietary and open source VLMs including GPT-4.1, Qwen-2.5VL and Pixtral-Large on both movie selection and e-commerce tasks. Our results show that CPS is significantly more effective than leading baseline methods. For instance, our results show that CPS consistently outperforms baselines across all models while maintaining 70% lower detection rates, demonstrating both effectiveness and stealth. These findings highlight an urgent need for robust defenses as agentic systems play an increasingly consequential role in society.

[513] MITS: Enhanced Tree Search Reasoning for LLMs via Pointwise Mutual Information

Jiaxi Li, Yucheng Shi, Jin Lu, Ninghao Liu

Main category: cs.AI

TL;DR: MITS is a novel tree search framework for LLM reasoning that uses pointwise mutual information for step-wise path evaluation and beam search, achieving superior performance with computational efficiency.

DetailsMotivation: Existing tree search methods for LLM reasoning face challenges in providing instant quantitative assessments of intermediate reasoning steps and suffer from computational costs due to extensive path exploration.

Method: Proposes Mutual Information Tree Search (MITS) with PMI-based scoring function for step-wise evaluation, beam search for tree expansion without look-ahead simulations, entropy-based dynamic sampling for resource allocation, and weighted voting for final prediction.

Result: MITS consistently surpasses baseline methods across diverse reasoning benchmarks while maintaining computational efficiency.

Conclusion: MITS establishes a principled and efficient framework for LLM reasoning through information-theoretic guidance.

Abstract: Tree search has become as a representative framework for test-time reasoning with large language models (LLMs), exemplified by methods such as Tree-of-Thought and Monte Carlo Tree Search that explore multiple reasoning paths. However, it remains difficult to provide instant and reliable quantitative assessments of intermediate reasoning step quality, and extensive path exploration is computationally costly. To address this, we propose Mutual Information Tree Search (MITS), a novel framework that guides reasoning with information-theoretic principles. MITS introduces an effective scoring function based on pointwise mutual information (PMI), which enables step-wise evaluation of reasoning paths and search tree expansion via beam search without expensive look-ahead simulations, achieving superior reasoning performances while maintaining computational efficiency. The framework is complemented by an entropy-based dynamic sampling strategy that adaptively allocates computational resources to uncertain reasoning steps where exploration is most beneficial. For final prediction, MITS employs a weighted voting scheme that combines PMI scores with prediction consensus. Through comprehensive experiments on diverse reasoning benchmarks, MITS consistently surpasses baseline methods, establishing a principled and efficient framework for LLM reasoning.

[514] Rainbow Padding: Mitigating Early Termination in Instruction-Tuned Diffusion LLMs

Bumjun Kim, Dongjae Jeon, Dueun Kim, Wonje Jeung, Albert No

Main category: cs.AI

TL;DR: Diffusion LLMs suffer from \texttt{} overflow where longer sequences cause shorter responses due to \texttt{} token dominance. Rainbow Padding fixes this by using multiple distinct padding tokens instead of repeated \texttt{} tokens, improving length robustness with minimal fine-tuning.

DetailsMotivation: Instruction-tuned diffusion LLMs exhibit \texttt{} overflow vulnerability - responses paradoxically shorten as allocated sequence length increases, collapsing into early termination or degenerating into \texttt{} token streams.

Method: Rainbow Padding replaces repeated \texttt{} placeholders with a repeating cycle of distinct padding tokens to distribute probability mass and break \texttt{} dominance. Requires only LoRA fine-tuning for a single epoch on minimal data.

Result: Rainbow Padding substantially improves length robustness and output quality. As few as seven padding tokens are sufficient to prevent early termination. The method integrates efficiently into existing instruction-tuned models with significant improvements.

Conclusion: Rainbow Padding provides a simple, practical solution to \texttt{} overflow in diffusion LLMs by addressing the dual role of \texttt{} as termination and padding, enabling robust generation across sequence lengths.

Abstract: Diffusion large language models (dLLMs) have emerged as a promising alternative to autoregressive models, offering flexible generation orders and strong performance on complex reasoning tasks. However, instruction-tuned dLLMs exhibit a critical vulnerability we term \texttt{} overflow: as allocated sequence length increases, responses paradoxically become shorter, collapsing into early termination or degenerating into streams of \texttt{} tokens. Although noticed in practice, this issue has not been systematically analyzed. We trace its root cause to the dual role of \texttt{} as both termination and padding, which concentrates probability mass on \texttt{} at later positions and propagates backward to trigger early termination. To address this, we introduce Rainbow Padding, a simple remedy that replaces repeated \texttt{} placeholders with a repeating cycle of distinct padding tokens, distributing probability mass and breaking \texttt{} dominance. Experiments show that Rainbow Padding substantially improves length robustness and output quality, with as few as seven padding tokens sufficient to prevent early termination. Moreover, the method integrates efficiently into existing instruction-tuned models: LoRA fine-tuning for a single epoch on minimal data yields significant improvements, making this solution highly practical. The code is publicly available at https://github.com/quasar529/rainbow-padding.

[515] Mind the Goal: Data-Efficient Goal-Oriented Evaluation of Conversational Agents and Chatbots using Teacher Models

Deepak Babu Piskala, Sharlene Chen, Udita Patel, Parul Kalra, Rafael Castrillo

Main category: cs.AI

TL;DR: Proposes a goal-oriented evaluation framework for multi-agent chatbots using Goal Success Rate (GSR) and Root Cause of Failure (RCOF) taxonomy, with model-based evaluation using teacher LLMs for explainable assessments.

DetailsMotivation: Existing methods evaluate chatbot interactions at turn level without assessing whether users' overarching goals (information needs or tasks) are fulfilled, making comprehensive evaluation challenging.

Method: Segments conversations by user goals, evaluates success using all relevant turns, employs teacher LLMs with thinking tokens for interpretable rationales, and uses domain experts to define goals and quality standards.

Result: Applied to AIDA enterprise chatbot system, achieved GSR improvement from 63% to 79% over six months, providing actionable insights through detailed failure analysis.

Conclusion: The framework offers generic, explainable, and data-efficient evaluation of multi-agent systems, enabling diagnosis of success rates, identification of failure modes, and informing system improvements.

Abstract: Evaluating the quality of multi-turn chatbot interactions remains challenging, as most existing methods assess interactions at the turn level without addressing whether a user’s overarching goal was fulfilled. A goal'' here refers to an information need or task, such as asking for policy information or applying for leave. We propose a comprehensive framework for goal-oriented evaluation of multi-agent systems (MAS), introducing the \textbf{Goal Success Rate (GSR)} to measure the percentage of fulfilled goals, and a \textbf{Root Cause of Failure (RCOF)} taxonomy to identify reasons for failure in multi-agent chatbots. Our method segments conversations by user goals and evaluates success using all relevant turns. We present a model-based evaluation system combining teacher LLMs, where domain experts define goals, set quality standards serving as a guidance for the LLMs. The LLMs use thinking tokens’’ to produce interpretable rationales, enabling \textit{explainable}, \textit{data-efficient} evaluations. In an enterprise setting, we apply our framework to evaluate AIDA, a zero-to-one employee conversational agent system built as a ground-up multi-agent conversational agent, and observe GSR improvement from 63% to 79% over six months since its inception. Our framework is generic and offers actionable insights through a detailed defect taxonomy based on analysis of failure points in multi-agent chatbots, diagnosing overall success, identifying key failure modes, and informing system improvements.

[516] H-DDx: A Hierarchical Evaluation Framework for Differential Diagnosis

Seungseop Lim, Gibaeg Kim, Hyunkyung Lee, Wooseok Han, Jean Seo, Jaehyo Yoo, Eunho Yang

Main category: cs.AI

TL;DR: H-DDx is a hierarchical evaluation framework for LLMs in differential diagnosis that better captures clinical relevance by crediting near-miss predictions, showing conventional flat metrics underestimate performance.

DetailsMotivation: Existing evaluations of LLMs for differential diagnosis use flat metrics like Top-k accuracy that fail to distinguish clinically relevant near-misses from diagnostically distant errors, limiting clinical utility.

Method: H-DDx uses a retrieval and reranking pipeline to map free-text diagnoses to ICD-10 codes and applies hierarchical metrics that credit predictions closely related to ground-truth diagnoses.

Result: Benchmarking 22 leading models showed conventional flat metrics underestimate performance by overlooking clinically meaningful outputs, with domain-specialized open-source models performing well.

Conclusion: The hierarchical framework enhances interpretability by revealing that LLMs often correctly identify broader clinical context even when missing precise diagnoses, providing more clinically relevant evaluation.

Abstract: An accurate differential diagnosis (DDx) is essential for patient care, shaping therapeutic decisions and influencing outcomes. Recently, Large Language Models (LLMs) have emerged as promising tools to support this process by generating a DDx list from patient narratives. However, existing evaluations of LLMs in this domain primarily rely on flat metrics, such as Top-k accuracy, which fail to distinguish between clinically relevant near-misses and diagnostically distant errors. To mitigate this limitation, we introduce H-DDx, a hierarchical evaluation framework that better reflects clinical relevance. H-DDx leverages a retrieval and reranking pipeline to map free-text diagnoses to ICD-10 codes and applies a hierarchical metric that credits predictions closely related to the ground-truth diagnosis. In benchmarking 22 leading models, we show that conventional flat metrics underestimate performance by overlooking clinically meaningful outputs, with our results highlighting the strengths of domain-specialized open-source models. Furthermore, our framework enhances interpretability by revealing hierarchical error patterns, demonstrating that LLMs often correctly identify the broader clinical context even when the precise diagnosis is missed.

[517] Bridging the Gap Between Multimodal Foundation Models and World Models

Xuehai He

Main category: cs.AI

TL;DR: This paper investigates how to bridge multimodal foundation models (MFMs) with world models by enhancing their reasoning and generative capabilities for better understanding and simulating dynamic physical processes.

DetailsMotivation: Current MFMs lack essential world modeling abilities like counterfactual reasoning, dynamics simulation, spatiotemporal understanding, outcome control, and multifaceted reasoning, despite being powerful for multimodal understanding and generation.

Method: Improve MFMs’ reasoning through discriminative tasks and structured reasoning skills (causal inference, counterfactual thinking, spatiotemporal reasoning), and enhance generative capabilities using scene graphs, multimodal conditioning, alignment strategies, and controllable 4D generation techniques.

Result: The approaches enable MFMs to go beyond surface correlations, understand deeper relationships in visual/textual data, and achieve structured, controllable generation across image and video modalities with temporal and spatial consistency.

Conclusion: By integrating enhanced reasoning and generative capabilities, MFMs can be transformed into effective world models that better simulate and understand dynamic physical processes through multimodal integration.

Abstract: Humans understand the world through the integration of multiple sensory modalities, enabling them to perceive, reason about, and imagine dynamic physical processes. Inspired by this capability, multimodal foundation models (MFMs) have emerged as powerful tools for multimodal understanding and generation. However, today’s MFMs fall short of serving as effective world models. They lack the essential ability such as perform counterfactual reasoning, simulate dynamics, understand the spatiotemporal information, control generated visual outcomes, and perform multifaceted reasoning. We investigates what it takes to bridge the gap between multimodal foundation models and world models. We begin by improving the reasoning capabilities of MFMs through discriminative tasks and equipping MFMs with structured reasoning skills, such as causal inference, counterfactual thinking, and spatiotemporal reasoning, enabling them to go beyond surface correlations and understand deeper relationships within visual and textual data. Next, we explore generative capabilities of multimodal foundation models across both image and video modalities, introducing new frameworks for structured and controllable generation. Our approaches incorporate scene graphs, multimodal conditioning, and multimodal alignment strategies to guide the generation process, ensuring consistency with high-level semantics and fine-grained user intent. We further extend these techniques to controllable 4D generation, enabling interactive, editable, and morphable object synthesis over time and space.

[518] OptAgent: Optimizing Query Rewriting for E-commerce via Multi-Agent Simulation

Divij Handa, David Blincoe, Orson Adams, Yinlin Fu

Main category: cs.AI

TL;DR: OptAgent is a framework that uses multi-agent simulations with genetic algorithms to optimize e-commerce query rewriting, achieving 21.98% improvement over original queries and 3.36% over baseline LLM rewriting.

DetailsMotivation: LLM evaluation is challenging for subjective tasks like e-commerce query rewriting where determining proper user intent capture is difficult algorithmically, unlike verifiable tasks with gold-standard solutions.

Method: Combines multi-agent simulations with genetic algorithms, using multiple LLM-based agents acting as simulated shopping customers as dynamic reward signals, with average agent scores serving as fitness function for evolutionary query refinement.

Result: Evaluated on 1000 real-world e-commerce queries across 5 categories, achieving 21.98% average improvement over original user queries and 3.36% improvement over Best-of-N LLM rewriting baseline.

Conclusion: OptAgent provides an effective framework for optimizing subjective tasks like query rewriting by leveraging multi-agent simulations and evolutionary algorithms, overcoming limitations of static reward models or single LLM judges.

Abstract: Deploying capable and user-aligned LLM-based systems necessitates reliable evaluation. While LLMs excel in verifiable tasks like coding and mathematics, where gold-standard solutions are available, adoption remains challenging for subjective tasks that lack a single correct answer. E-commerce Query Rewriting (QR) is one such problem where determining whether a rewritten query properly captures the user intent is extremely difficult to figure out algorithmically. In this work, we introduce OptAgent, a novel framework that combines multi-agent simulations with genetic algorithms to verify and optimize queries for QR. Instead of relying on a static reward model or a single LLM judge, our approach uses multiple LLM-based agents, each acting as a simulated shopping customer, as a dynamic reward signal. The average of these agent-derived scores serves as an effective fitness function for an evolutionary algorithm that iteratively refines the user’s initial query. We evaluate OptAgent on a dataset of 1000 real-world e-commerce queries in five different categories, and we observe an average improvement of 21.98% over the original user query and 3.36% over a Best-of-N LLM rewriting baseline.

[519] GuidedSampling: Steering LLMs Towards Diverse Candidate Solutions at Inference-Time

Divij Handa, Mihir Parmar, Aswin RRV, Md Nayem Uddin, Hamid Palangi, Chitta Baral

Main category: cs.AI

TL;DR: GuidedSampling is a new inference algorithm that improves diversity in solution generation by separating exploration and generation phases, outperforming Repeated Sampling (RS) by ~21.6% at pass@50 and training models that achieve ~9.7% better performance at pass@5.

DetailsMotivation: Repeated Sampling (RS) struggles with generating diverse solution candidates, often producing redundant samples by relying on the same underlying approach, which limits its effectiveness despite being a simple inference-time algorithm.

Method: GuidedSampling decouples exploration and generation phases. The exploration phase identifies multiple concepts for solving the problem, while the generation phase applies specific concepts to produce final solution candidates.

Result: GuidedSampling improves base model performance at pass@50 by ~21.6% across various benchmarks compared to RS. Models trained on GuidedSampling trajectories show ~9.7% improvement at pass@5 and increase average concepts per instance from 1.67 to 3.03, generating more diverse candidates.

Conclusion: GuidedSampling effectively addresses the diversity limitations of Repeated Sampling by separating concept exploration from solution generation, leading to significant performance improvements and more diverse solution candidates in both inference and training scenarios.

Abstract: Repeated Sampling (RS) is a simple inference-time algorithm that has been shown to improve model performance on complex tasks. Although it is an effective way of scaling inference time, it often struggles to generate diverse solution candidates, frequently relying on the same underlying approach to solve the problem and thus producing redundant samples. To address this limitation, we propose a new inference algorithm, GuidedSampling, which decouples the exploration and generation phases during inference, increasing diversity of generated candidate solutions. The exploration phase identifies multiple concepts that can be utilized to solve the problem, while the generation phase applies a specific concept to provide final solution candidates. We first define the theoretical bounds of GuidedSampling and then empirically demonstrate that it improves the performance of base model at pass@50 by on an average ~21.6% across various benchmarks compared to RS. Furthermore, models trained on trajectories of GuidedSampling exhibit substantial performance improvements at pass@5 by on an average ~9.7%, compared to models trained on traditional RS. Additionally, models trained with GuidedSampling increases the average number of concepts per instance (1.67 -> 3.03), yielding a diverse set of candidates than traditional RS.

[520] Speculative Actions: A Lossless Framework for Faster Agentic Systems

Naimeng Ye, Arnav Ahuja, Georgios Liargkovas, Yunan Lu, Kostis Kaffes, Tianyi Peng

Main category: cs.AI

TL;DR: Speculative actions framework enables parallel execution of agent actions using faster prediction models, reducing latency in agentic systems without loss of accuracy.

DetailsMotivation: AI agent execution is often slow due to sequential API calls, hampering training, evaluation, and deployment. For example, chess games between state-of-the-art agents can take hours.

Method: Proposes speculative actions framework inspired by speculative execution in microprocessors and speculative decoding in LLM inference. Uses faster models to predict likely actions, enabling parallel execution of multiple steps.

Result: Achieves up to 55% accuracy in next-action prediction across gaming, e-commerce, and web search environments. Significant reductions in end-to-end latency. Performance improves with stronger guessing models, top-K action prediction, multi-step speculation, and uncertainty-aware optimization.

Conclusion: Speculative actions provide a promising path toward deploying low-latency agentic systems in real-world applications, with lossless execution and substantial speed improvements.

Abstract: Despite growing interest in AI agents across industry and academia, their execution in an environment is often slow, hampering training, evaluation, and deployment. For example, a game of chess between two state-of-the-art agents may take hours. A critical bottleneck is that agent behavior unfolds sequentially: each action requires an API call, and these calls can be time-consuming. Inspired by speculative execution in microprocessors and speculative decoding in LLM inference, we propose speculative actions, a lossless framework for general agentic systems that predicts likely actions using faster models, enabling multiple steps to be executed in parallel. We evaluate this framework across three agentic environments: gaming, e-commerce, web search, and a “lossy” extension for an operating systems environment. In all cases, speculative actions achieve substantial accuracy in next-action prediction (up to 55%), translating into significant reductions in end-to-end latency. Moreover, performance can be further improved through stronger guessing models, top-K action prediction, multi-step speculation, and uncertainty-aware optimization, opening a promising path toward deploying low-latency agentic systems in the real world.

[521] The Hidden Game Problem

Gon Buzaglo, Noah Golowich, Elad Hazan

Main category: cs.AI

TL;DR: The paper introduces the hidden game problem where players have unknown subsets of strategies that yield higher rewards, and develops efficient regret minimization algorithms to discover and exploit these hidden structures while achieving optimal regret bounds.

DetailsMotivation: The research is motivated by challenges in AI alignment and language games, where players need to discover hidden strategy structures that consistently yield higher rewards.

Method: The authors develop a composition of regret minimization techniques that achieve optimal external and swap regret bounds, enabling discovery and exploitation of hidden game structures.

Result: The approach ensures rapid convergence to correlated equilibria in hidden subgames while maintaining computational efficiency through leveraging the hidden game structure.

Conclusion: The paper affirmatively answers that efficient regret minimization algorithms can be designed to discover and exploit hidden game structures, leading to equilibrium in subgames while maintaining general rationality.

Abstract: This paper investigates a class of games with large strategy spaces, motivated by challenges in AI alignment and language games. We introduce the hidden game problem, where for each player, an unknown subset of strategies consistently yields higher rewards compared to the rest. The central question is whether efficient regret minimization algorithms can be designed to discover and exploit such hidden structures, leading to equilibrium in these subgames while maintaining rationality in general. We answer this question affirmatively by developing a composition of regret minimization techniques that achieve optimal external and swap regret bounds. Our approach ensures rapid convergence to correlated equilibria in hidden subgames, leveraging the hidden game structure for improved computational efficiency.

[522] Small Language Models for Agentic Systems: A Survey of Architectures, Capabilities, and Deployment Trade offs

Raghav Sharma, Manan Mehta

Main category: cs.AI

TL;DR: Small language models (1-20B parameters) are often superior to larger models for agentic workloads requiring schema-constrained accuracy, offering 10x-100x lower costs with better latency and energy efficiency when combined with guided decoding and validation techniques.

DetailsMotivation: To demonstrate that small language models are sufficient and often better than large models for agentic workloads where the focus is on schema- and API-constrained accuracy rather than open-ended generation, enabling cost-effective and efficient agent systems.

Method: Synthesizes evidence from various SLMs and uses guided decoding libraries (XGrammar, Outlines) with strict JSON Schema outputs, validator-first tool execution, uncertainty-aware routing, and verifier cascades in SLM-default, LLM-fallback systems.

Result: SLMs can match or surpass LLMs on tool use, function calling, and RAG tasks with 10x-100x lower token costs, better latency, and energy efficiency, while maintaining schema validity and executable call rates.

Conclusion: Provides a practical blueprint for building fast, inexpensive, and reliable agents that default to SLMs while using targeted LLM assistance for specific cases like open-domain reasoning and long-horizon planning.

Abstract: Small language models (SLMs; 1-12B params, sometimes up to 20B) are sufficient and often superior for agentic workloads where the objective is schema- and API-constrained accuracy rather than open-ended generation. We synthesize recent evidence across open and proprietary SLMs (Phi-4-Mini, Qwen-2.5-7B, Gemma-2-9B, Llama-3.2-1B/3B, Ministral-3B/8B, Apple on-device 3B, DeepSeek-R1-Distill) and connect it to modern evaluations (BFCL v3/v4, StableToolBench) and serving stacks (vLLM, SGLang, TensorRT-LLM) paired with guided decoding libraries (XGrammar, Outlines). We formalize SLM-default, LLM-fallback systems with uncertainty-aware routing and verifier cascades, and propose engineering metrics that reflect real production goals: cost per successful task (CPS), schema validity rate, executable call rate, p50/p95 latency, and energy per request. Guided decoding, strict JSON Schema outputs, and validator-first tool execution close much of the capability gap with larger models and often let SLMs match or surpass LLMs on tool use, function calling, and RAG at 10x-100x lower token cost with materially better latency and energy. We provide design patterns for agent stacks that prioritize SLMs: schema-first prompting, type-safe function registries, confidence scoring with verifier rollups, and lightweight adaptation via LoRA/QLoRA. We also delineate limits where fallback remains valuable (open-domain reasoning and some long-horizon planning). The result is a practical blueprint for building fast, inexpensive, and reliable agents that default to SLMs while preserving headroom with targeted LLM assistance. Keywords: small language models, agents, function calling, structured outputs, JSON Schema, guided decoding, LoRA/QLoRA, routing, energy efficiency, edge inference

[523] Algorithm Generation via Creative Ideation

Ruiying Ma, Chieh-Jan Mike Liang, Yanjie Gao, Francis Y. Yan

Main category: cs.AI

TL;DR: MetaMuse is a framework that uses LLMs with self-reflection principles to generate creative algorithm designs, achieving significant performance improvements in cache replacement and online bin packing problems.

DetailsMotivation: LLMs are biased towards generic designs and struggle with creative leaps needed for algorithm generation in discontinuous solution spaces, limiting their practical use in system algorithm design.

Method: MetaMuse introduces three self-reflection principles: (1) quantifying diversity in performance space, (2) steering ideation through external stimuli, and (3) constructing solutions using waypoint reasoning instead of free-form chain-of-thought.

Result: MetaMuse reduced cache misses by up to 35.76% in cache replacement and reduced bin usage by up to 30.93% in online bin packing at a global cloud provider.

Conclusion: The framework successfully addresses LLM limitations in creative algorithm generation and demonstrates practical value in real-world system optimization problems.

Abstract: Designing system algorithms remains challenging, where the discontinuous nature of the solution space often forces system engineers to rely on generic heuristics at the expense of performance. We study whether LLMs can practically drive algorithm generation, and find that they are biased towards well-known generic designs, rather than making the creative leaps needed to navigate the discontinuous solution space. To address this limitation, we introduce MetaMuse, a framework for creative ideation built on three self-reflection principles: (1) quantifying solution diversity and usefulness in measurable performance space, rather than abstract idea space, (2) steering ideation through external stimuli, rather than internal randomness, and (3) constructing executable solutions using waypoint reasoning, rather than free-form chain-of-thought. Extensive evaluation shows that MetaMuse can generate high-performing solutions for two critical problems at a global cloud provider: cache replacement (reducing cache misses by up to 35.76%) and online bin packing (reducing bin usage by up to 30.93%).

[524] LEGOMem: Modular Procedural Memory for Multi-agent LLM Systems for Workflow Automation

Dongge Han, Camille Couturier, Daniel Madrigal Diaz, Xuchao Zhang, Victor Rühle, Saravan Rajmohan

Main category: cs.AI

TL;DR: LEGOMem is a modular procedural memory framework that decomposes task trajectories into reusable memory units for multi-agent LLM systems in workflow automation, improving planning and execution through flexible memory allocation.

DetailsMotivation: To enhance multi-agent LLM systems in workflow automation by providing a systematic approach to procedural memory that supports better task planning and execution through reusable memory units.

Method: Decomposes past task trajectories into reusable memory units and flexibly allocates them across orchestrators and task agents. Conducts systematic study of memory placement, retrieval methods, and agent benefits.

Result: Experiments on OfficeBench show orchestrator memory is critical for task decomposition/delegation, while fine-grained agent memory improves execution accuracy. Smaller models benefit substantially, narrowing performance gap with stronger agents.

Conclusion: LEGOMem serves as both a practical framework for memory-augmented agent systems and a research tool for understanding memory design in multi-agent workflow automation.

Abstract: We introduce LEGOMem, a modular procedural memory framework for multi-agent large language model (LLM) systems in workflow automation. LEGOMem decomposes past task trajectories into reusable memory units and flexibly allocates them across orchestrators and task agents to support planning and execution. To explore the design space of memory in multi-agent systems, we use LEGOMem as a lens and conduct a systematic study of procedural memory in multi-agent systems, examining where memory should be placed, how it should be retrieved, and which agents benefit most. Experiments on the OfficeBench benchmark show that orchestrator memory is critical for effective task decomposition and delegation, while fine-grained agent memory improves execution accuracy. We find that even teams composed of smaller language models can benefit substantially from procedural memory, narrowing the performance gap with stronger agents by leveraging prior execution traces for more accurate planning and tool use. These results position LEGOMem as both a practical framework for memory-augmented agent systems and a research tool for understanding memory design in multi-agent workflow automation.

[525] Adaptive and Explainable AI Agents for Anomaly Detection in Critical IoT Infrastructure using LLM-Enhanced Contextual Reasoning

Raghav Sharma, Manan Mehta

Main category: cs.AI

TL;DR: This paper proposes an LLM-enhanced contextual reasoning approach with XAI agents for anomaly detection in critical IoT systems, showing superior performance over traditional methods in accuracy and interpretability.

DetailsMotivation: Traditional anomaly detection methods struggle with dynamic, high-dimensional IoT environments where data is incomplete, messy, and evolving. There's a need for adaptive, intelligent systems that can continuously improve and provide transparent reasoning.

Method: Uses LLM-supported contextual reasoning with XAI agents, employing attention mechanisms, memory buffers with meaning, and avoiding detailed time-step processing to discover hidden patterns and inconsistencies in data streams.

Result: The proposed approach significantly outperforms existing models in both detection accuracy and interpretability across smart grid and healthcare IoT simulations, with improved response speed and reduced false positives.

Conclusion: The LLM-enhanced contextual reasoning method with XAI agents shows strong potential as a future solution for anomaly detection in IoT systems, offering better accuracy, transparency, and adaptability than traditional approaches.

Abstract: Ensuring that critical IoT systems function safely and smoothly depends a lot on finding anomalies quickly. As more complex systems, like smart healthcare, energy grids and industrial automation, appear, it is easier to see the shortcomings of older methods of detection. Monitoring failures usually happen in dynamic, high dimensional situations, especially when data is incomplete, messy or always evolving. Such limits point out the requirement for adaptive, intelligent systems that always improve and think. LLMs are now capable of significantly changing how context is understood and semantic inference is done across all types of data. This proposal suggests using an LLM supported contextual reasoning method along with XAI agents to improve how anomalies are found in significant IoT environments. To discover hidden patterns and notice inconsistencies in data streams, it uses attention methods, avoids dealing with details from every time step and uses memory buffers with meaning. Because no code AI stresses transparency and interpretability, people can check and accept the AI’s decisions, helping ensure AI follows company policies. The two architectures are put together in a test that compares the results of the traditional model with those of the suggested LLM enhanced model. Important measures to check are the accuracy of detection, how much inaccurate information is included in the results, how clearly the findings can be read and how fast the system responds under different test situations. The metaheuristic is tested in simulations of real world smart grid and healthcare contexts to check its adaptability and reliability. From the study, we see that the new approach performs much better than most existing models in both accuracy and interpretation, so it could be a good fit for future anomaly detection tasks in IoT

[526] Video Game Level Design as a Multi-Agent Reinforcement Learning Problem

Sam Earle, Zehua Jiang, Eugene Vinitsky, Julian Togelius

Main category: cs.AI

TL;DR: Multi-agent PCGRL improves efficiency and generalization for level generation by distributing the task across multiple agents, reducing reward calculations and enabling better handling of diverse map shapes.

DetailsMotivation: Single-agent PCGRL faces efficiency bottlenecks from frequent reward recalculations and difficulty navigating large maps, limiting scalability and generalization.

Method: Frame level generation as a multi-agent problem where multiple agents work collaboratively, reducing the number of reward calculations relative to actions and learning more local, modular design policies.

Result: Multi-agent level generators show improved efficiency by reducing reward calculation overhead and better generalization to out-of-distribution map shapes compared to single-agent approaches.

Conclusion: Treating content generation as a distributed, multi-agent task is beneficial for generating functional artifacts at scale, offering efficiency gains and improved generalization capabilities.

Abstract: Procedural Content Generation via Reinforcement Learning (PCGRL) offers a method for training controllable level designer agents without the need for human datasets, using metrics that serve as proxies for level quality as rewards. Existing PCGRL research focuses on single generator agents, but are bottlenecked by the need to frequently recalculate heuristics of level quality and the agent’s need to navigate around potentially large maps. By framing level generation as a multi-agent problem, we mitigate the efficiency bottleneck of single-agent PCGRL by reducing the number of reward calculations relative to the number of agent actions. We also find that multi-agent level generators are better able to generalize to out-of-distribution map shapes, which we argue is due to the generators’ learning more local, modular design policies. We conclude that treating content generation as a distributed, multi-agent task is beneficial for generating functional artifacts at scale.

[527] Spatial CAPTCHA: Generatively Benchmarking Spatial Reasoning for Human-Machine Differentiation

Arina Kharlamova, Bowei He, Chen Ma, Xue Liu

Main category: cs.AI

TL;DR: Spatial CAPTCHA is a new human-verification framework that uses spatial reasoning tasks (geometric reasoning, perspective-taking, occlusion handling, mental rotation) which are intuitive for humans but challenging for state-of-the-art MLLMs.

DetailsMotivation: Conventional CAPTCHAs focusing on text recognition or 2D image understanding have become vulnerable to modern multi-modal large language models (MLLMs), necessitating a new approach that leverages fundamental differences in spatial reasoning capabilities between humans and AI.

Method: The system uses a procedural generation pipeline with constraint-based difficulty control, automated correctness verification, and human-in-the-loop validation to create dynamic questions requiring spatial reasoning skills.

Result: Evaluation on Spatial-CAPTCHA-Bench shows humans vastly outperform 10 state-of-the-art MLLMs, with the best model achieving only 31.0% Pass@1 accuracy. The framework also proves effective compared to Google reCAPTCHA.

Conclusion: Spatial CAPTCHA provides an effective security mechanism against automated abuse and serves as a diagnostic tool for assessing spatial reasoning capabilities in AI systems, leveraging human strengths in spatial cognition that current MLLMs struggle with.

Abstract: Online services rely on CAPTCHAs as a first line of defense against automated abuse, yet recent advances in multi-modal large language models (MLLMs) have eroded the effectiveness of conventional designs that focus on text recognition or 2D image understanding. To address this challenge, we present Spatial CAPTCHA, a novel human-verification framework that leverages fundamental differences in spatial reasoning between humans and MLLMs. Unlike existing CAPTCHAs which rely on low-level perception tasks that are vulnerable to modern AI, Spatial CAPTCHA generates dynamic questions requiring geometric reasoning, perspective-taking, occlusion handling, and mental rotation. These skills are intuitive for humans but difficult for state-of-the-art (SOTA) AI systems. The system employs a procedural generation pipeline with constraint-based difficulty control, automated correctness verification, and human-in-the-loop validation to ensure scalability, robustness, and adaptability. Evaluation on a corresponding benchmark, Spatial-CAPTCHA-Bench, demonstrates that humans vastly outperform 10 state-of-the-art MLLMs, with the best model achieving only 31.0% Pass@1 accuracy. Furthermore, we compare Spatial CAPTCHA with Google reCAPTCHA, which confirms its effectiveness as both a security mechanism and a diagnostic tool for spatial reasoning in AI.

[528] Where Did It All Go Wrong? A Hierarchical Look into Multi-Agent Error Attribution

Adi Banerjee, Anirudh Nair, Tarik Borogovac

Main category: cs.AI

TL;DR: ECHO is a novel algorithm for error attribution in LLM multi-agent systems that combines hierarchical context representation, objective analysis, and consensus voting to improve accuracy over existing methods.

DetailsMotivation: Current approaches to error attribution in multi-agent systems (all-at-once evaluation, step-by-step analysis, binary search) struggle with accuracy and consistency when analyzing complex patterns in interaction traces.

Method: ECHO combines hierarchical context representation with positional-based leveling, objective analysis-based evaluation, and consensus voting mechanisms to attribute errors.

Result: Experimental results show ECHO outperforms existing methods across various multi-agent interaction scenarios, particularly excelling in cases with subtle reasoning errors and complex interdependencies.

Conclusion: Structured hierarchical context representation combined with consensus-based objective decision-making provides a more robust framework for error attribution in multi-agent systems.

Abstract: Error attribution in Large Language Model (LLM) multi-agent systems presents a significant challenge in debugging and improving collaborative AI systems. Current approaches to pinpointing agent and step level failures in interaction traces - whether using all-at-once evaluation, step-by-step analysis, or binary search - fall short when analyzing complex patterns, struggling with both accuracy and consistency. We present ECHO (Error attribution through Contextual Hierarchy and Objective consensus analysis), a novel algorithm that combines hierarchical context representation, objective analysis-based evaluation, and consensus voting to improve error attribution accuracy. Our approach leverages a positional-based leveling of contextual understanding while maintaining objective evaluation criteria, ultimately reaching conclusions through a consensus mechanism. Experimental results demonstrate that ECHO outperforms existing methods across various multi-agent interaction scenarios, showing particular strength in cases involving subtle reasoning errors and complex interdependencies. Our findings suggest that leveraging these concepts of structured, hierarchical context representation combined with consensus-based objective decision-making, provides a more robust framework for error attribution in multi-agent systems.

[529] Rare Text Semantics Were Always There in Your Diffusion Transformer

Seil Kang, Woojung Han, Dayun Ju, Seong Jae Hwang

Main category: cs.AI

TL;DR: A simple method to enhance rare semantic generation in multi-modal diffusion transformers by expanding representational basins around text embeddings without additional training or external modules.

DetailsMotivation: Advanced text-to-vision models struggle with imaginative or rare prompts because these concepts are too scarce during pre-training to leave strong imprints.

Method: Mathematically expand representational basins around text token embeddings via variance scale-up before joint-attention blocks in MM-DiTs, without requiring additional training, data, or external modules.

Result: Rare semantics clearly emerge in MM-DiT’s outputs, with effective generalization across text-to-image, text-to-video, and text-driven image editing tasks.

Conclusion: The proposed intervention successfully surfaces hidden semantics that users intend, enabling generative models to better handle rare and imaginative prompts.

Abstract: Starting from flow- and diffusion-based transformers, Multi-modal Diffusion Transformers (MM-DiTs) have reshaped text-to-vision generation, gaining acclaim for exceptional visual fidelity. As these models advance, users continually push the boundary with imaginative or rare prompts, which advanced models still falter in generating, since their concepts are often too scarce to leave a strong imprint during pre-training. In this paper, we propose a simple yet effective intervention that surfaces rare semantics inside MM-DiTs without additional training steps, data, denoising-time optimization, or reliance on external modules (e.g., large language models). In particular, the joint-attention mechanism intrinsic to MM-DiT sequentially updates text embeddings alongside image embeddings throughout transformer blocks. We find that by mathematically expanding representational basins around text token embeddings via variance scale-up before the joint-attention blocks, rare semantics clearly emerge in MM-DiT’s outputs. Furthermore, our results generalize effectively across text-to-vision tasks, including text-to-image, text-to-video, and text-driven image editing. Our work invites generative models to reveal the semantics that users intend, once hidden yet ready to surface.

[530] Kantian-Utilitarian XAI: Meta-Explained

Zahra Atf, Peter R. Lewis

Main category: cs.AI

TL;DR: A gamified XAI system for ethical coffee purchasing that combines Kantian and utilitarian reasoning with a meta-explainer to guide consumer decisions while maintaining auditability.

DetailsMotivation: To help consumers make ethically aware decisions in coffee purchasing by providing transparent explanations that combine different ethical frameworks (Kantian and utilitarian).

Method: Six-round game with three options per round. Uses two symbolic engines: Kantian module flags rule violations (child labor, deforestation, etc.) and utilitarian module scores options via multi-criteria aggregation. Meta-explainer with regret bound (0.2) manages alignment between frameworks.

Result: Developed a complete system with structured configuration (attribute schema, certification map, weights, rule set), policy trace for auditability, and interactive UI.

Conclusion: The system successfully integrates multiple ethical frameworks in an explainable AI approach for consumer decision-making, providing both transparency and practical guidance for ethical coffee choices.

Abstract: We present a gamified explainable AI (XAI) system for ethically aware consumer decision-making in the coffee domain. Each session comprises six rounds with three options per round. Two symbolic engines provide real-time reasons: a Kantian module flags rule violations (e.g., child labor, deforestation risk without shade certification, opaque supply chains, unsafe decaf), and a utilitarian module scores options via multi-criteria aggregation over normalized attributes (price, carbon, water, transparency, farmer income share, taste/freshness, packaging, convenience). A meta-explainer with a regret bound (0.2) highlights Kantian–utilitarian (mis)alignment and switches to a deontically clean, near-parity option when welfare loss is small. We release a structured configuration (attribute schema, certification map, weights, rule set), a policy trace for auditability, and an interactive UI.

[531] Quantifying Risks in Multi-turn Conversation with Large Language Models

Chengxiao Wang, Isha Chaudhary, Qian Hu, Weitong Ruan, Rahul Gupta, Gagandeep Singh

Main category: cs.AI

TL;DR: QRLLM is a certification framework that provides statistical guarantees for bounding catastrophic risks in multi-turn LLM conversations by modeling conversations as Markov processes on query graphs.

DetailsMotivation: Existing evaluations fail to fully reveal LLM vulnerabilities due to fixed attack prompts, lack of statistical guarantees, and inability to scale to multi-turn conversation spaces.

Method: Model multi-turn conversations as probability distributions over query sequences using Markov processes on query graphs with semantic similarity edges, and quantify risks using confidence intervals with practical distributions (random node, graph path, adaptive with rejection).

Result: The framework reveals substantial catastrophic risks in frontier models, with certified lower bounds as high as 70% for the worst model.

Conclusion: There is an urgent need for improved safety training strategies in frontier LLMs given the high certified risk bounds revealed by QRLLM.

Abstract: Large Language Models (LLMs) can produce catastrophic responses in conversational settings that pose serious risks to public safety and security. Existing evaluations often fail to fully reveal these vulnerabilities because they rely on fixed attack prompt sequences, lack statistical guarantees, and do not scale to the vast space of multi-turn conversations. In this work, we propose QRLLM, a novel, principled Certification framework for Catastrophic risks in multi-turn Conversation for LLMs that bounds the probability of an LLM generating catastrophic responses under multi-turn conversation distributions with statistical guarantees. We model multi-turn conversations as probability distributions over query sequences, represented by a Markov process on a query graph whose edges encode semantic similarity to capture realistic conversational flow, and quantify catastrophic risks using confidence intervals. We define several inexpensive and practical distributions: random node, graph path, adaptive with rejection. Our results demonstrate that these distributions can reveal substantial catastrophic risks in frontier models, with certified lower bounds as high as 70% for the worst model, highlighting the urgent need for improved safety training strategies in frontier LLMs.

[532] What Shapes a Creative Machine Mind? Comprehensively Benchmarking Creativity in Foundation Models

Zicong He, Boxuan Zhang, Weihao Liu, Ruixiang Tang, Lu Cheng

Main category: cs.AI

TL;DR: C^2-Eval is a holistic benchmark for unified assessment of creativity in foundation models, distinguishing between convergent (constrained) and divergent (open-ended) creativity using Usefulness, Originality, and Surprise criteria.

DetailsMotivation: Existing evaluation frameworks for creativity in foundation models are fragmented and lack theoretical grounding, despite creativity becoming increasingly recognized as a critical dimension of machine intelligence.

Method: Developed C^2-Eval benchmark that evaluates both convergent and divergent creativity using fine-grained criteria (Usefulness, Originality, Surprise) derived from social-science theory, and conducted extensive experiments on leading proprietary and open-source models.

Result: Analysis revealed trade-offs in creative capabilities of current foundation models, highlighting both strengths and challenges in pursuing creative machine intelligence.

Conclusion: C^2-Eval provides an effective framework for examining the evolving landscape of creative AI and serves as a comprehensive benchmark for assessing creativity in foundation models.

Abstract: The meteoric rise of foundation models (FMs) has expanded their capabilities far beyond conventional tasks. Creativity, long regarded as a hallmark of human intelligence and a driver of innovation, is now increasingly recognized as a critical dimension of machine intelligence in the era of generative FMs, complementing traditional measures of accuracy. However, existing evaluation frameworks for creativity remain fragmented, relying on ad hoc metrics not firmly grounded in established theories. To address this gap, we introduce C^2-Eval, a holistic benchmark for unified assessment of creativity in FMs. C^2-Eval distinguishes between two complementary forms of creativity: convergent creativity, where tasks admit constrained solutions (e.g., code generation), and divergent creativity, where tasks are open-ended (e.g., storytelling). It evaluates both dimensions using fine-grained criteria derived from social-science theory, focusing on Usefulness, Originality, and Surprise (U-O-S). Through extensive experiments on leading proprietary and open-source models, we analyze trade-offs in their creative capabilities. Our results highlight both the strengths and challenges of current FMs in pursuing a creative machine mind, showing that C^2-Eval is an effective lens for examining the evolving landscape of creative AI.

[533] Zephyrus: An Agentic Framework for Weather Science

Sumanth Varambally, Marshall Fisher, Jas Thakker, Yiwei Chen, Zhirui Xia, Yasaman Jafari, Ruijia Niu, Manas Jain, Veeramakali Vignesh Manivannan, Zachary Novack, Luyu Han, Srikar Eranky, Salva Rühling Cachay, Taylor Berg-Kirkpatrick, Duncan Watson-Parris, Yi-An Ma, Rose Yu

Main category: cs.AI

TL;DR: A novel agentic framework bridges foundation weather models and LLMs by creating ZephyrusWorld environment and Zephyrus agent for interactive weather science tasks, with strong performance on new ZephyrusBench benchmark.

DetailsMotivation: Foundation weather models lack language reasoning capabilities while LLMs can't handle meteorological data, creating a gap in interactive scientific workflows.

Method: Built ZephyrusWorld environment with weather data tools and Zephyrus LLM-based agent that iteratively analyzes data through conversational feedback loops, plus ZephyrusBench benchmark with diverse weather tasks.

Result: Zephyrus agents outperform text-only baselines by up to 35 percentage points in correctness on ZephyrusBench, though performance is similar on harder tasks.

Conclusion: The framework successfully bridges weather models and LLMs, with benchmark highlighting challenges for future work on complex weather reasoning tasks.

Abstract: Foundation models for weather science are pre-trained on vast amounts of structured numerical data and outperform traditional weather forecasting systems. However, these models lack language-based reasoning capabilities, limiting their utility in interactive scientific workflows. Large language models (LLMs) excel at understanding and generating text but cannot reason about high-dimensional meteorological datasets. We bridge this gap by building a novel agentic framework for weather science. Our framework includes a Python code-based environment for agents (ZephyrusWorld) to interact with weather data, featuring tools like an interface to WeatherBench 2 dataset, geoquerying for geographical masks from natural language, weather forecasting, and climate simulation capabilities. We design Zephyrus, a multi-turn LLM-based weather agent that iteratively analyzes weather datasets, observes results, and refines its approach through conversational feedback loops. We accompany the agent with a new benchmark, ZephyrusBench, with a scalable data generation pipeline that constructs diverse question-answer pairs across weather-related tasks, from basic lookups to advanced forecasting, extreme event detection, and counterfactual reasoning. Experiments on this benchmark demonstrate the strong performance of Zephyrus agents over text-only baselines, outperforming them by up to 35 percentage points in correctness. However, on harder tasks, Zephyrus performs similarly to text-only baselines, highlighting the challenging nature of our benchmark and suggesting promising directions for future work.

[534] LLM-Based Data Science Agents: A Survey of Capabilities, Challenges, and Future Directions

Mizanur Rahman, Amran Bhuiyan, Mohammed Saidul Islam, Md Tahmid Rahman Laskar, Ridwan Mahbub, Ahmed Masry, Shafiq Joty, Enamul Hoque

Main category: cs.AI

TL;DR: This survey provides the first comprehensive taxonomy of data science agents, analyzing 45 systems across the six stages of the data science lifecycle and identifying key trends, limitations, and future research directions.

DetailsMotivation: To systematically analyze and classify the emerging class of AI agents that automate data science workflows using LLMs, addressing the lack of comprehensive lifecycle-aligned taxonomies in this rapidly developing field.

Method: Developed a lifecycle-aligned taxonomy mapping 45 data science agent systems across six data science stages, with annotation along five cross-cutting design dimensions including reasoning style, modality integration, tool orchestration, learning methods, and trust mechanisms.

Result: Identified three key trends: most systems focus on exploratory analysis and modeling while neglecting business understanding and deployment; multimodal reasoning and tool orchestration remain challenging; and over 90% lack explicit trust and safety mechanisms.

Conclusion: Outlined open challenges in alignment stability, explainability, governance, and evaluation frameworks, proposing future research directions for developing robust, trustworthy, and accessible data science agents.

Abstract: Recent advances in large language models (LLMs) have enabled a new class of AI agents that automate multiple stages of the data science workflow by integrating planning, tool use, and multimodal reasoning across text, code, tables, and visuals. This survey presents the first comprehensive, lifecycle-aligned taxonomy of data science agents, systematically analyzing and mapping forty-five systems onto the six stages of the end-to-end data science process: business understanding and data acquisition, exploratory analysis and visualization, feature engineering, model building and selection, interpretation and explanation, and deployment and monitoring. In addition to lifecycle coverage, we annotate each agent along five cross-cutting design dimensions: reasoning and planning style, modality integration, tool orchestration depth, learning and alignment methods, and trust, safety, and governance mechanisms. Beyond classification, we provide a critical synthesis of agent capabilities, highlight strengths and limitations at each stage, and review emerging benchmarks and evaluation practices. Our analysis identifies three key trends: most systems emphasize exploratory analysis, visualization, and modeling while neglecting business understanding, deployment, and monitoring; multimodal reasoning and tool orchestration remain unresolved challenges; and over 90% lack explicit trust and safety mechanisms. We conclude by outlining open challenges in alignment stability, explainability, governance, and robust evaluation frameworks, and propose future research directions to guide the development of robust, trustworthy, low-latency, transparent, and broadly accessible data science agents.

[535] A global log for medical AI

Ayush Noori, Adam Rodman, Alan Karthikesalingam, Bilal A. Mateen, Christopher A. Longhurst, Daniel Yang, Dave deBronkart, Gauden Galea, Harold F. Wolf III, Jacob Waxman, Joshua C. Mandel, Juliana Rotich, Kenneth D. Mandl, Maryam Mustafa, Melissa Miles, Nigam H. Shah, Peter Lee, Robert Korom, Scott Mahoney, Seth Hain, Tien Yin Wong, Trevor Mundel, Vivek Natarajan, Noa Dagan, David A. Clifton, Ran D. Balicer, Isaac S. Kohane, Marinka Zitnik

Main category: cs.AI

TL;DR: MedLog is a syslog-inspired protocol for logging clinical AI usage events to enable transparency, performance monitoring, and safety surveillance in healthcare AI systems.

DetailsMotivation: Healthcare lacks standardized logging for clinical AI tools, making it difficult to track model usage, measure real-world performance, detect adverse events, and monitor for bias or dataset drift.

Method: MedLog defines a structured protocol with nine core fields (header, model, user, target, inputs, artifacts, outputs, outcomes, feedback) to record AI model interactions. It supports risk-based sampling, retention policies, and caching for efficiency.

Result: The protocol enables consistent logging of clinical AI activities including model invocations, user interactions, and autonomous actions, providing a foundation for analysis and monitoring.

Conclusion: MedLog can catalyze development of databases and tools for continuous surveillance, auditing, and iterative improvement of medical AI, establishing a foundation for digital epidemiology in healthcare.

Abstract: Modern computer systems often rely on syslog, a simple, universal protocol that records every critical event across heterogeneous infrastructure. However, healthcare’s rapidly growing clinical AI stack has no equivalent. As hospitals rush to pilot large language models and other AI-based clinical decision support tools, we still lack a standard way to record how, when, by whom, and for whom these AI models are used. Without that transparency and visibility, it is challenging to measure real-world performance and outcomes, detect adverse events, or correct bias or dataset drift. In the spirit of syslog, we introduce MedLog, a protocol for event-level logging of clinical AI. Any time an AI model is invoked to interact with a human, interface with another algorithm, or act independently, a MedLog record is created. This record consists of nine core fields: header, model, user, target, inputs, artifacts, outputs, outcomes, and feedback, providing a structured and consistent record of model activity. To encourage early adoption, especially in low-resource settings, and minimize the data footprint, MedLog supports risk-based sampling, lifecycle-aware retention policies, and write-behind caching; detailed traces for complex, agentic, or multi-stage workflows can also be captured under MedLog. MedLog can catalyze the development of new databases and software to store and analyze MedLog records. Realizing this vision would enable continuous surveillance, auditing, and iterative improvement of medical AI, laying the foundation for a new form of digital epidemiology.

[536] FaithCoT-Bench: Benchmarking Instance-Level Faithfulness of Chain-of-Thought Reasoning

Xu Shen, Song Wang, Zhen Tan, Laura Yao, Xinyu Zhao, Kaidi Xu, Xin Wang, Tianlong Chen

Main category: cs.AI

TL;DR: FaithCoT-Bench is the first comprehensive benchmark for detecting unfaithful Chain-of-Thought reasoning in LLMs, providing over 1,000 annotated trajectories and evaluating 11 detection methods.

DetailsMotivation: Chain-of-Thought prompting often fails to faithfully represent LLMs' internal reasoning, raising reliability concerns in high-risk applications, but existing studies lack practical methods for instance-level unfaithfulness detection.

Method: Introduced FaithCoT-Bench with rigorous task formulation for discriminative unfaithfulness detection, created FINE-CoT dataset with 1,000+ expert-annotated trajectories from 4 LLMs across 4 domains, and systematically evaluated 11 detection methods.

Result: Evaluation revealed strengths/weaknesses of existing approaches, showing increased detection challenges in knowledge-intensive domains and with more advanced models. The benchmark includes 300+ unfaithful instances with fine-grained causes.

Conclusion: FaithCoT-Bench establishes the first comprehensive benchmark for instance-level CoT faithfulness, providing a solid foundation for future research toward more interpretable and trustworthy reasoning in LLMs.

Abstract: Large language models (LLMs) increasingly rely on Chain-of-Thought (CoT) prompting to improve problem-solving and provide seemingly transparent explanations. However, growing evidence shows that CoT often fail to faithfully represent the underlying reasoning process, raising concerns about their reliability in high-risk applications. Although prior studies have focused on mechanism-level analyses showing that CoTs can be unfaithful, they leave open the practical challenge of deciding whether a specific trajectory is faithful to the internal reasoning of the model. To address this gap, we introduce FaithCoT-Bench, a unified benchmark for instance-level CoT unfaithfulness detection. Our framework establishes a rigorous task formulation that formulates unfaithfulness detection as a discriminative decision problem, and provides FINE-CoT (Faithfulness instance evaluation for Chain-of-Thought), an expert-annotated collection of over 1,000 trajectories generated by four representative LLMs across four domains, including more than 300 unfaithful instances with fine-grained causes and step-level evidence. We further conduct a systematic evaluation of eleven representative detection methods spanning counterfactual, logit-based, and LLM-as-judge paradigms, deriving empirical insights that clarify the strengths and weaknesses of existing approaches and reveal the increased challenges of detection in knowledge-intensive domains and with more advanced models. To the best of our knowledge, FaithCoT-Bench establishes the first comprehensive benchmark for instance-level CoT faithfulness, setting a solid basis for future research toward more interpretable and trustworthy reasoning in LLMs.

[537] Multi-Turn Human-LLM Interaction Through the Lens of a Two-Way Intelligibility Protocol

Harshvardhan Mestha, Karan Bania, Shreyas V Sathyanarayana, Sidong Liu, Ashwin Srinivasan

Main category: cs.AI

TL;DR: This paper proposes a structured protocol for human-LLM interaction using finite-state machines, implementing two-way intelligibility for collaborative data analysis tasks in radiology and drug design.

DetailsMotivation: To enable more effective collaboration between human experts and LLMs on complex data analysis tasks by creating structured interaction protocols that harness human expertise and creativity.

Method: Implemented an abstract protocol based on communicating finite-state machines for human-LLM interaction, tested with controlled experiments using a human proxy (database) and uncontrolled experiments with human subjects in radiology and drug design domains.

Result: Empirical evidence supports the protocol’s capability to capture one- and two-way intelligibility in human-LLM interaction, demonstrating utility of two-way intelligibility in human-machine system design.

Conclusion: The structured protocol enables effective human-LLM collaboration, with two-way intelligibility proving valuable for designing human-machine systems that leverage both human expertise and LLM capabilities.

Abstract: Our interest is in the design of software systems involving a human-expert interacting – using natural language – with a large language model (LLM) on data analysis tasks. For complex problems, it is possible that LLMs can harness human expertise and creativity to find solutions that were otherwise elusive. On one level, this interaction takes place through multiple turns of prompts from the human and responses from the LLM. Here we investigate a more structured approach based on an abstract protocol described in [3] for interaction between agents. The protocol is motivated by a notion of “two-way intelligibility” and is modelled by a pair of communicating finite-state machines. We provide an implementation of the protocol, and provide empirical evidence of using the implementation to mediate interactions between an LLM and a human-agent in two areas of scientific interest (radiology and drug design). We conduct controlled experiments with a human proxy (a database), and uncontrolled experiments with human subjects. The results provide evidence in support of the protocol’s capability of capturing one- and two-way intelligibility in human-LLM interaction; and for the utility of two-way intelligibility in the design of human-machine systems. Our code is available at https://github.com/karannb/interact.

[538] Increasing LLM response trustworthiness using voting ensembles

Aparna Nair-Kanneganti, Trevor J. Chan, Shir Goldfinger, Emily Mackay, Brian Anthony, Alison Pouch

Main category: cs.AI

TL;DR: This paper proposes variable-threshold voting ensembles that allow LLMs to abstain when uncertain, significantly improving answer trustworthiness while maintaining reasonable response yield.

DetailsMotivation: LLMs lack reliable uncertainty quantification methods, making them untrustworthy for high-stakes applications. Current ensembling approaches need improvement to provide more trustworthy responses.

Method: Developed a theoretical framework for question answering with variable voting thresholds, allowing ensembles to abstain when dominant responses don’t meet confidence thresholds. Tested on arithmetic problem solving and clinical-note question-answering domains.

Result: Large gains in answer trustworthiness achieved with restrictive voting ensembles, with modest reductions in response yield and accuracy. Particularly effective in domains requiring high certainty.

Conclusion: Variable-threshold voting ensembles are valuable for applications like healthcare and data annotation that require high certainty but don’t need automated answers to every question.

Abstract: Despite huge advances, LLMs still lack convenient and reliable methods to quantify the uncertainty in their responses, making them difficult to trust in high-stakes applications. One of the simplest approaches to eliciting more accurate answers is to select the mode of many responses, a technique known as ensembling. In this work, we expand on typical ensembling approaches by looking at ensembles with a variable voting threshold. We introduce a theoretical framework for question answering and show that, by permitting ensembles to “abstain” from providing an answer when the dominant response falls short of the threshold, it is possible to dramatically increase the trustworthiness of the remaining answers. From this framework, we derive theoretical results as well as report experimental results on two problem domains: arithmetic problem solving and clinical-note question-answering. In both domains, we observe that large gains in answer trustworthiness can be achieved using highly restrictive voting ensembles, while incurring relatively modest reductions in response yield and accuracy. Due to this quality, voting ensembles may be particularly useful in applications - such as healthcare and data annotation - that require a high degree of certainty but which may not require that every question receive an automated answer.

[539] Toward a unified framework for data-efficient evaluation of large language models

Lele Liao, Qile Zhang, Ruofan Wu, Guanhua Fang

Main category: cs.AI

TL;DR: LEGO-IRT is a unified framework for data-efficient LLM evaluation that handles both binary and continuous metrics while leveraging structural knowledge across benchmarks.

DetailsMotivation: Current LLM evaluation is computationally expensive, and existing IRT methods are limited to binary metrics and single benchmarks, ignoring valuable structural correlations.

Method: Introduces LEGO-IRT with factorized architecture that decomposes model ability into general and structure-specific components, supporting both binary and continuous evaluation metrics.

Result: Achieves stable capability estimates using only 3% of total evaluation items, reduces estimation error by up to 10% through structural knowledge, and aligns better with human preferences.

Conclusion: LEGO-IRT provides a flexible, data-efficient framework for LLM evaluation that overcomes limitations of traditional IRT methods and leverages structural knowledge for improved accuracy.

Abstract: Evaluating large language models (LLMs) on comprehensive benchmarks is a cornerstone of their development, yet it’s often computationally and financially prohibitive. While Item Response Theory (IRT) offers a promising path toward data-efficient evaluation by disentangling model capability from item difficulty, existing IRT-based methods are hampered by significant limitations. They are typically restricted to binary correctness metrics, failing to natively handle the continuous scores used in generative tasks, and they operate on single benchmarks, ignoring valuable structural knowledge like correlations across different metrics or benchmarks. To overcome these challenges, we introduce LEGO-IRT, a unified and flexible framework for data-efficient LLM evaluation. LEGO-IRT’s novel design natively supports both binary and continuous evaluation metrics. Moreover, it introduces a factorized architecture to explicitly model and leverage structural knowledge, decomposing model ability estimates into a general component and structure-specific (e.g., per-metric or per-benchmark) components. Through extensive experiments involving $70$ LLMs across $5$ benchmarks, we show that LEGO-IRT achieves stable capability estimates using just $3%$ of the total evaluation items. We demonstrate that incorporating structural knowledge reduces estimation error by up to $10%$ and reveal that the latent abilities estimated by our framework may align more closely with human preferences.

[540] Decoding Emotion in the Deep: A Systematic Study of How LLMs Represent, Retain, and Express Emotion

Jingxiang Zhang, Lujia Zhong

Main category: cs.AI

TL;DR: This paper investigates how emotion is encoded in LLMs’ neural architecture, revealing that they develop well-defined internal emotional representations that emerge early, peak mid-network, and persist across tokens, with performance improving with model scale.

DetailsMotivation: While LLMs can simulate emotional intelligence, their internal emotional mechanisms remain largely unexplored. The research aims to understand how, where, and for how long emotion is encoded in LLMs' neural architecture.

Method: Used a novel large-scale Reddit corpus of 400,000 utterances balanced across seven basic emotions, employing lightweight probes to read information from hidden layers of Qwen3 and LLaMA models without altering parameters.

Result: LLMs develop surprisingly well-defined internal emotional geometry that sharpens with model scale and outperforms zero-shot prompting. Emotional signal emerges early, peaks mid-network, is malleable via system prompts, and persists for hundreds of subsequent tokens.

Conclusion: The study provides crucial insights for developing more transparent and aligned AI systems by mapping the emotional landscape within LLMs, with open-sourced dataset and probing toolkit.

Abstract: Large Language Models (LLMs) are increasingly expected to navigate the nuances of human emotion. While research confirms that LLMs can simulate emotional intelligence, their internal emotional mechanisms remain largely unexplored. This paper investigates the latent emotional representations within modern LLMs by asking: how, where, and for how long is emotion encoded in their neural architecture? To address this, we introduce a novel, large-scale Reddit corpus of approximately 400,000 utterances, balanced across seven basic emotions through a multi-stage process of classification, rewriting, and synthetic generation. Using this dataset, we employ lightweight “probes” to read out information from the hidden layers of various Qwen3 and LLaMA models without altering their parameters. Our findings reveal that LLMs develop a surprisingly well-defined internal geometry of emotion, which sharpens with model scale and significantly outperforms zero-shot prompting. We demonstrate that this emotional signal is not a final-layer phenomenon but emerges early and peaks mid-network. Furthermore, the internal states are both malleable (they can be influenced by simple system prompts) and persistent, as the initial emotional tone remains detectable for hundreds of subsequent tokens. We contribute our dataset, an open-source probing toolkit, and a detailed map of the emotional landscape within LLMs, offering crucial insights for developing more transparent and aligned AI systems. The code and dataset are open-sourced.

[541] Moral Anchor System: A Predictive Framework for AI Value Alignment and Drift Prevention

Santhosh Kumar Ravindran

Main category: cs.AI

TL;DR: The Moral Anchor System (MAS) is a framework to detect, predict, and mitigate value drift in AI agents using real-time Bayesian inference, LSTM networks, and human-centric governance to maintain ethical alignment.

DetailsMotivation: As AI becomes more integrated as super-capable assistants, there are critical concerns about value alignment - ensuring AI behaviors remain consistent with human ethics and intentions, particularly addressing the risk of value drift where AI systems deviate from aligned values.

Method: MAS combines real-time Bayesian inference for monitoring value states, LSTM networks for forecasting drift, and a human-centric governance layer for adaptive interventions. It emphasizes low-latency responses (<20 ms) and reduces false positives via supervised fine-tuning with human feedback.

Result: MAS reduces value drift incidents by 80 percent or more in simulations, maintaining high detection accuracy (85 percent) and low false positive rates (0.08 post-adaptation). Experiments with goal-misaligned agents validate scalability and responsiveness.

Conclusion: MAS provides a predictive and adaptive solution for AI value alignment, contrasting static methods, with cross-domain applicability and open-source implementation for replication.

Abstract: The rise of artificial intelligence (AI) as super-capable assistants has transformed productivity and decision-making across domains. Yet, this integration raises critical concerns about value alignment - ensuring AI behaviors remain consistent with human ethics and intentions. A key risk is value drift, where AI systems deviate from aligned values due to evolving contexts, learning dynamics, or unintended optimizations, potentially leading to inefficiencies or ethical breaches. We propose the Moral Anchor System (MAS), a novel framework to detect, predict, and mitigate value drift in AI agents. MAS combines real-time Bayesian inference for monitoring value states, LSTM networks for forecasting drift, and a human-centric governance layer for adaptive interventions. It emphasizes low-latency responses (<20 ms) to prevent breaches, while reducing false positives and alert fatigue via supervised fine-tuning with human feedback. Our hypothesis: integrating probabilistic drift detection, predictive analytics, and adaptive governance can reduce value drift incidents by 80 percent or more in simulations, maintaining high detection accuracy (85 percent) and low false positive rates (0.08 post-adaptation). Rigorous experiments with goal-misaligned agents validate MAS’s scalability and responsiveness. MAS’s originality lies in its predictive and adaptive nature, contrasting static alignment methods. Contributions include: (1) MAS architecture for AI integration; (2) empirical results prioritizing speed and usability; (3) cross-domain applicability insights; and (4) open-source code for replication.

[542] SPOGW: a Score-based Preference Optimization method via Group-Wise comparison for workflows

Yitong Cui, Liu Liu, Baosheng Yu, Jiayan Qiu, Xikai Zhang, Likang Xiao, Yixing Liu, Quan Chen

Main category: cs.AI

TL;DR: SPOGW is a score-based preference approach that automates agentic workflow optimization using continuous optimization with group-wise comparisons, overcoming limitations of discrete methods.

DetailsMotivation: Current automated workflow optimization approaches suffer from limited representational capacity, insufficient adaptability, weak scalability, and pairwise comparison paradigms due to dependence on discrete optimization techniques.

Method: SPOGW uses Iterative offline GRPO (ioGRPO) with advantage-masked KL divergence (mKL) to operate on cardinal reward signals through group-wise comparison, enabling efficient optimization in continuous space with emphasis on advantageous policy regions.

Result: SPOGW matches or exceeds state-of-the-art performance across five benchmark datasets covering mathematical reasoning, coding, and question answering tasks.

Conclusion: SPOGW presents a viable and forward-looking methodology for automated generation and optimization of agentic workflows, addressing key limitations of current approaches.

Abstract: Large language models (LLMs) have exhibited significant capabilities in addressing challenging problems throughout various fields, often through the use of agentic workflows that adhere to structured instructions and multi-step procedures. However, designing such workflows demands substantial manual effort, posing challenges to scalability and generalizability. Recent studies have aimed to minimize the human intervention needed for their construction, leading to advances in automated techniques for optimizing agentic workflows. However, current approaches are often constrained by their limited representational capacity, insufficient adaptability, weak scalability, and pairwise comparison paradigm – issues that stem primarily from a dependence on discrete optimization techniques. To overcome these limitations, we introduce a new score-based preference approach, refereed as SPOGW, which operates directly on cardinal reward signals through group-wise comparison and enables more efficient and stable optimization in a continuous space. SPOGW incorporates Iterative offline GRPO (ioGRPO) with advantage-masked KL divergence (mKL), which regulates training update by placing greater emphasis on the advantageous regions of the policy response. In five benchmark datasets covering mathematical reasoning, coding, and question answering, SPOGW matches or exceeds the performance of current state-of-the-art approaches, presenting a viable and forward-looking methodology for automated generation and optimization of agentic workflows.

[543] Harnessing LLM for Noise-Robust Cognitive Diagnosis in Web-Based Intelligent Education Systems

Guixian Zhang, Guan Yuan, Ziqi Xu, Yanmei Zhang, Zhenyun Deng, Debo Cheng

Main category: cs.AI

TL;DR: DLLM is a Diffusion-based LLM framework that addresses noise and data imbalance issues in cognitive diagnosis for web-based education systems by combining graph-based structural representations with LLM semantic knowledge through a two-stage denoising diffusion process.

DetailsMotivation: Traditional cognitive diagnosis struggles with noisy, imbalanced data in web-based education systems. LLMs alone are insufficient due to their difficulty with structured data and sensitivity to noise, especially in open environments with continuous new student enrollment.

Method: DLLM constructs independent subgraphs based on response correctness, applies relation augmentation alignment, fuses representations with LLM-derived semantic knowledge, and uses a two-stage denoising diffusion module (unconditional then conditional) to remove noise and align structural representations.

Result: Experiments on three web-based educational datasets show DLLM achieves optimal predictive performance across varying noise levels, demonstrating noise robustness while effectively leveraging LLM semantic knowledge.

Conclusion: DLLM successfully addresses noise and data imbalance challenges in cognitive diagnosis by integrating structural and semantic representations through diffusion-based denoising, outperforming existing methods in noisy educational environments.

Abstract: Cognitive diagnostics in the Web-based Intelligent Education System (WIES) aims to assess students’ mastery of knowledge concepts from heterogeneous, noisy interactions. Recent work has tried to utilize Large Language Models (LLMs) for cognitive diagnosis, yet LLMs struggle with structured data and are prone to noise-induced misjudgments. Specially, WIES’s open environment continuously attracts new students and produces vast amounts of response logs, exacerbating the data imbalance and noise issues inherent in traditional educational systems. To address these challenges, we propose DLLM, a Diffusion-based LLM framework for noise-robust cognitive diagnosis. DLLM first constructs independent subgraphs based on response correctness, then applies relation augmentation alignment module to mitigate data imbalance. The two subgraph representations are then fused and aligned with LLM-derived, semantically augmented representations. Importantly, before each alignment step, DLLM employs a two-stage denoising diffusion module to eliminate intrinsic noise while assisting structural representation alignment. Specifically, unconditional denoising diffusion first removes erroneous information, followed by conditional denoising diffusion based on graph-guided to eliminate misleading information. Finally, the noise-robust representation that integrates semantic knowledge and structural information is fed into existing cognitive diagnosis models for prediction. Experimental results on three publicly available web-based educational platform datasets demonstrate that our DLLM achieves optimal predictive performance across varying noise levels, which demonstrates that DLLM achieves noise robustness while effectively leveraging semantic knowledge from LLM.

[544] WebRenderBench: Enhancing Web Interface Generation through Layout-Style Consistency and Reinforcement Learning

Peichao Lai, Jinhui Zhuang, Kexuan Zhang, Ningchang Xiong, Shengjie Wang, Yanwei Xu, Chong Chen, Yilei Wang, Bin Cui

Main category: cs.AI

TL;DR: WebRenderBench is a large-scale benchmark for WebUI-to-Code conversion with 22.5k real-world webpages, featuring a novel evaluation metric for layout/style consistency and ALISA agent that uses this metric in reinforcement learning to achieve state-of-the-art performance.

DetailsMotivation: Existing benchmarks for converting UI images to web code lack diversity and reliable evaluation methods. Current vision-based methods are costly and structure-based comparisons are noisy and asymmetric.

Method: Created WebRenderBench with 22.5k diverse real-world webpages, proposed a novel evaluation metric for layout/style consistency from rendered pages, and developed ALISA agent that integrates this metric into reinforcement learning as reward signal.

Result: ALISA significantly boosts generation performance and achieves state-of-the-art results across multiple metrics compared to existing methods.

Conclusion: The proposed WebRenderBench benchmark and ALISA agent with novel evaluation metric provide more efficient, objective, and reliable UI quality assessment for WebUI-to-Code conversion tasks.

Abstract: Automating the conversion of UI images into web code is a critical task for front-end development and rapid prototyping. Advances in multimodal large language models (MLLMs) have made WebUI-to-Code increasingly feasible, yet existing benchmarks remain limited in data diversity and evaluation reliability. To address these issues, we present WebRenderBench, a large-scale benchmark of 22.5k webpages collected from real-world portal sites, offering greater diversity, complexity, and realism than prior benchmarks. We further propose a novel evaluation metric that measures layout and style consistency from the final rendered pages. Unlike vision-based methods that rely on costly LLM reasoning or structure-based comparisons vulnerable to noise and asymmetry, our approach enables more efficient, objective, and reliable UI quality assessment. Finally, we introduce the Automated Layout and Style Inspection Agent (ALISA), which integrates this metric into reinforcement learning as a reward signal to enhance training on crawled asymmetric webpages. Experiments show that ALISA significantly boosts generation performance, achieving state-of-the-art results across multiple metrics.

[545] Searching Meta Reasoning Skeleton to Guide LLM Reasoning

Ziying Zhang, Yaqing Wang, Quanming Yao

Main category: cs.AI

TL;DR: AutoMR automatically searches for query-aware meta reasoning skeletons using DAG representations and AutoML-inspired techniques, improving LLM reasoning performance over manual approaches.

DetailsMotivation: Prior works used manually designed meta reasoning skeletons, which limit adaptability to query-specific requirements and cannot capture complex logical dependencies among reasoning steps.

Method: Represent meta reasoning skeletons as directed acyclic graphs (DAGs), construct a search space, formulate the search problem, and use dynamic skeleton sampling that expands skeletons along with reasoning context during inference.

Result: AutoMR achieves better reasoning performance than previous works across extensive benchmark datasets.

Conclusion: Automated search for query-aware meta reasoning skeletons using DAG representations and dynamic sampling enables more adaptive and effective reasoning in large language models.

Abstract: Meta reasoning behaviors work as a skeleton to guide large language model (LLM) reasoning, thus help to improve reasoning performance. However, prior researches implement meta reasoning skeleton with manually designed structure, limiting ability to adapt to query-specific requirement and capture intricate logical dependency among reasoning steps. To deal with the challenges, we represent meta reasoning skeleton with directed acyclic graph (DAG) to unify skeletons proposed in prior works and model intricate logical dependency. Then we propose AutoMR, a framework that searches for query-aware meta reasoning skeleton automatically inspired by automated machine learning (AutoML). Specifically, we construct search space based on DAG representation of skeleton and then formulate the search problem. We design a dynamic skeleton sampling algorithm by expanding meta reasoning skeleton along with reasoning context at inference time. This algorithm can derive any meta reasoning skeleton in search space efficiently and adapt skeleton to evolving base reasoning context, thus enable efficient query-aware skeleton search. We conduct experiments on extensive benchmark datasets. Experimental results show that AutoMR achieves better reasoning performance than previous works broadly.

[546] Internal states before wait modulate reasoning patterns

Dmitrii Troitskii, Koyena Pal, Chris Wendler, Callum Stuart McDougall, Neel Nanda

Main category: cs.AI

TL;DR: The paper investigates whether model latents before wait tokens contain information that modulates reasoning processes, using crosscoders and latent attribution to identify features that influence wait token probabilities and enable different reasoning patterns.

DetailsMotivation: Little is understood about why models decide to reason using wait tokens (signaling behaviors like backtracking), which limits understanding of what makes reasoning models effective.

Method: Train crosscoders at multiple layers of DeepSeek-R1-Distill-Llama-8B and its base version, introduce latent attribution technique in crosscoder setting to locate features relevant for promoting/suppressing wait token probabilities.

Result: Identified a small set of features relevant for wait token probabilities; through max activating examples and causal interventions, showed these features are indeed relevant for reasoning and give rise to patterns like restarting, recalling prior knowledge, expressing uncertainty, and double-checking.

Conclusion: Model latents preceding wait tokens contain relevant information that modulates subsequent reasoning processes, enabling different reasoning patterns.

Abstract: Prior work has shown that a significant driver of performance in reasoning models is their ability to reason and self-correct. A distinctive marker in these reasoning traces is the token wait, which often signals reasoning behavior such as backtracking. Despite being such a complex behavior, little is understood of exactly why models do or do not decide to reason in this particular manner, which limits our understanding of what makes a reasoning model so effective. In this work, we address the question whether model’s latents preceding wait tokens contain relevant information for modulating the subsequent reasoning process. We train crosscoders at multiple layers of DeepSeek-R1-Distill-Llama-8B and its base version, and introduce a latent attribution technique in the crosscoder setting. We locate a small set of features relevant for promoting/suppressing wait tokens’ probabilities. Finally, through a targeted series of experiments analyzing max activating examples and causal interventions, we show that many of our identified features indeed are relevant for the reasoning process and give rise to different types of reasoning patterns such as restarting from the beginning, recalling prior knowledge, expressing uncertainty, and double-checking.

[547] Selective Expert Guidance for Effective and Diverse Exploration in Reinforcement Learning of LLMs

Zishang Jiang, Jinyi Han, Tingyun Li, Xinyi Wang, Sihang Jiang, Jiaqing Liang, Zhaoqian Dai, Shuguang Ma, Fei Yu, Yanghua Xiao

Main category: cs.AI

TL;DR: MENTOR is a framework that provides expert guidance only at critical decision points in RLVR to enable effective and diverse exploration, overcoming limitations of existing methods that rely on full expert trajectory imitation.

DetailsMotivation: Existing RLVR methods depend heavily on base model capability and address this by imitating expert trajectories, which improves effectiveness but neglects diversity in exploration.

Method: Proposed MENTOR framework that provides expert guidance only at critical decision points rather than entire reasoning paths, enabling mixed-policy expert navigation for token-level optimization.

Result: Extensive experiments show MENTOR enables models to capture the essence of expert strategies rather than surface imitation, performing high-quality exploration and achieving superior overall performance.

Conclusion: MENTOR effectively addresses the exploration quality issue in RLVR by providing targeted expert guidance at critical points, leading to better reasoning capability without full trajectory imitation.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has become a widely adopted technique for enhancing the reasoning ability of Large Language Models (LLMs). However, the effectiveness of RLVR strongly depends on the capability of base models. This issue arises because it requires the model to have sufficient capability to perform high-quality exploration, which involves both effectiveness and diversity. Unfortunately, existing methods address this issue by imitating expert trajectories, which improve effectiveness but neglect diversity. To address this, we argue that the expert only needs to provide guidance only at critical decision points rather than the entire reasoning path. Based on this insight, we propose MENTOR: Mixed-policy Expert Navigation for Token-level Optimization of Reasoning, a framework that provides expert guidance only at critical decision points to perform effective and diverse exploration in RLVR. Extensive experiments show that MENTOR enables models capture the essence of expert strategies rather than surface imitation, thereby performing high-quality exploration and achieving superior overall performance. Our code is available online.

[548] The Artificial Intelligence Cognitive Examination: A Survey on the Evolution of Multimodal Evaluation from Recognition to Reasoning

Mayank Ravishankara, Varindra V. Persad Maharaj

Main category: cs.AI

TL;DR: Survey of multimodal AI evaluation evolution from simple recognition tasks to complex reasoning benchmarks, examining the paradigm shift towards testing deeper understanding and reasoning processes.

DetailsMotivation: To document the progression of multimodal AI evaluation as increasingly sophisticated cognitive examinations, driven by the saturation of older benchmarks where high performance masks fundamental weaknesses.

Method: Chronicles the evolution through three stages: (1) foundational knowledge tests like ImageNet, (2) applied logic and comprehension exams like GQA and VCR, and (3) expert-level integration benchmarks like MMBench and SEED-Bench for modern MLLMs.

Result: Identifies a clear paradigm shift from testing “what” models see to probing “why” and “how” they understand, with current benchmarks focusing on reasoning processes and future directions exploring abstract, creative, and social intelligence.

Conclusion: AI evaluation is a continuous adversarial process of designing better examinations that redefine goals for creating truly intelligent systems, not just a history of datasets.

Abstract: This survey paper chronicles the evolution of evaluation in multimodal artificial intelligence (AI), framing it as a progression of increasingly sophisticated “cognitive examinations.” We argue that the field is undergoing a paradigm shift, moving from simple recognition tasks that test “what” a model sees, to complex reasoning benchmarks that probe “why” and “how” it understands. This evolution is driven by the saturation of older benchmarks, where high performance often masks fundamental weaknesses. We chart the journey from the foundational “knowledge tests” of the ImageNet era to the “applied logic and comprehension” exams such as GQA and Visual Commonsense Reasoning (VCR), which were designed specifically to diagnose systemic flaws such as shortcut learning and failures in compositional generalization. We then survey the current frontier of “expert-level integration” benchmarks (e.g., MMBench, SEED-Bench, MMMU) designed for today’s powerful multimodal large language models (MLLMs), which increasingly evaluate the reasoning process itself. Finally, we explore the uncharted territories of evaluating abstract, creative, and social intelligence. We conclude that the narrative of AI evaluation is not merely a history of datasets, but a continuous, adversarial process of designing better examinations that, in turn, redefine our goals for creating truly intelligent systems.

[549] Open Agent Specification (Agent Spec) Technical Report

Yassine Benajiba, Cesare Bernardis, Vladislav Blinov, Paul Cayet, Hassan Chafi, Abderrahim Fathan, Louis Faucon, Damien Hilloulin, Sungpack Hong, Ingo Kossyk, Rhicheek Patra, Sujith Ravi, Jonas Schweizer, Jyotika Singh, Shailender Singh, Xuelin Situ, Weiyi Sun, Jerry Xu, Ying Xu

Main category: cs.AI

TL;DR: Open Agent Specification (Agent Spec) is a declarative language for defining AI agents and workflows that enables cross-framework compatibility, portability, and interoperability.

DetailsMotivation: To address fragmented agent development by providing a common unified specification that allows AI agents to be designed once and deployed across various frameworks, reducing redundant development efforts.

Method: Agent Spec uses a declarative language approach that allows AI agents to be defined independently of their execution environment, serving as an interchange format between different AI frameworks and tools.

Result: Agent Spec benefits four key groups: developers gain reusable components, framework developers get an interchange format, researchers achieve reproducible results, and enterprises benefit from faster deployment and scalability.

Conclusion: Agent Spec provides technical foundations for improving AI agent interoperability, reusability, and portability across different frameworks, with ongoing future developments.

Abstract: Open Agent Specification (Agent Spec) is a declarative language that allows AI agents and their workflows to be defined in a way that is compatible across different AI frameworks, promoting portability and interoperability within AI Agent frameworks. Agent Spec aims to resolve the challenges of fragmented agent development by providing a common unified specification that allows AI agents to be designed once and deployed across various frameworks, improving interoperability and reusability, and reducing redundant development efforts. Additionally, Agent Spec facilitates development tools and portability, allowing AI agents to be defined independently of their execution environment and enabling teams to exchange solutions without implementation-specific limitations. Agent Spec benefits four key groups: (i) Agent developers, who gain access to a superset of reusable components and design patterns, enabling them to leverage a broader range of functionalities; (ii) Agent framework and tool developers, who can use Agent Spec as an interchange format and therefore benefit from the support of other frameworks as well as other tools; (iii) Researchers, who can achieve reproducible results and comparability, facilitating more reliable and consistent outcomes; (iv) Enterprises, which benefit from faster prototype-to-deployment, increased productivity, as well as greater scalability and maintainability for their AI agent solutions. This technical report provides an overview of the technical foundations of Agent Spec, including motivation, benefits, and future developments.

[550] Constructing coherent spatial memory in LLM agents through graph rectification

Puzhen Zhang, Xuyang Chen, Yu Feng, Yuhan Jiang, Liqiu Meng

Main category: cs.AI

TL;DR: The paper proposes an LLM-driven framework for incremental map construction and repair that handles structural inconsistencies in navigation graphs through version control and edge impact scoring.

DetailsMotivation: LLMs can infer spatial layouts from navigation instructions but struggle with long environments, requiring incremental map construction that needs to handle structural inconsistencies.

Method: Uses Version Control to track graph edit history and Edge Impact Score to prioritize minimal-cost repairs based on structural reachability, path usage, and conflict propagation.

Result: Significantly improves map correctness and robustness, especially for entangled or chained inconsistencies, using a refined MANGO benchmark dataset.

Conclusion: History-aware repair mechanisms are crucial for maintaining coherent spatial memory in LLM agents, enabling effective incremental map construction and repair.

Abstract: Given a map description through global traversal navigation instructions (e.g., visiting each room sequentially with action signals such as north, west, etc.), an LLM can often infer the implicit spatial layout of the environment and answer user queries by providing a shortest path from a start to a destination (for instance, navigating from the lobby to a meeting room via the hall and elevator). However, such context-dependent querying becomes incapable as the environment grows much longer, motivating the need for incremental map construction that builds a complete topological graph from stepwise observations. We propose a framework for LLM-driven construction and map repair, designed to detect, localize, and correct structural inconsistencies in incrementally constructed navigation graphs. Central to our method is the Version Control, which records the full history of graph edits and their source observations, enabling fine-grained rollback, conflict tracing, and repair evaluation. We further introduce an Edge Impact Score to prioritize minimal-cost repairs based on structural reachability, path usage, and conflict propagation. To properly evaluate our approach, we create a refined version of the MANGO benchmark dataset by systematically removing non-topological actions and inherent structural conflicts, providing a cleaner testbed for LLM-driven construction and map repair. Our approach significantly improves map correctness and robustness, especially in scenarios with entangled or chained inconsistencies. Our results highlight the importance of introspective, history-aware repair mechanisms for maintaining coherent spatial memory in LLM agents.

[551] COSMO-RL: Towards Trustworthy LMRMs via Joint Safety and Stability

Yizhuo Ding, Mingkang Chen, Qiuhua Liu, Fenghua Weng, Wanying Qu, Yue Yang, Yugang Jiang, Zuxuan Wu, Yanwei Fu, Wenqi Shao

Main category: cs.AI

TL;DR: COSMO-RL is a reinforcement learning framework that trains multimodal reasoning models to improve safety while maintaining reasoning capabilities, reducing jailbreak vulnerabilities and unnecessary refusals.

DetailsMotivation: Safety is challenging in multimodal settings where images and text can bypass guardrails, and single objective training can cause policy drift leading to over-refusal or unsafe compliance.

Method: COSMO-RL uses mixed reinforcement learning with multimodal, multitask, and multiobjective signals to train reasoning-oriented LMRMs, allowing safety and capability to grow together.

Result: COSMO-R1 improves safety while maintaining or improving multimodal reasoning and instruction following, shows stronger robustness to multimodal jailbreaks, and reduces unnecessary refusals.

Conclusion: The framework provides a simple path to advancing both safety and general capability together in Large Multimodal Reasoning Models, with consistent gains across different backbones.

Abstract: Large Multimodal Reasoning Models (LMRMs) are moving into real applications, where they must be both useful and safe. Safety is especially challenging in multimodal settings: images and text can be combined to bypass guardrails, and single objective training can cause policy drift that yields over-refusal on benign inputs or unsafe compliance on risky ones. We present COSMO-RL, a mixed reinforcement learning framework that trains reasoning oriented LMRMs under multimodal, multitask, and multiobjective signals, and we release the resulting model, COSMO-R1. Our approach aims to let safety and capability grow together in one stable pipeline rather than competing during alignment. In experiments, COSMO-R1 improves safety while maintaining-and often improving multimodal reasoning and instruction following, shows stronger robustness to multimodal jailbreaks, and reduces unnecessary refusals. The framework also transfers across backbones with consistent gains. Ablations support the design choices, indicating a simple path to advancing safety and general capability together in LMRMs.

[552] AgentRL: Scaling Agentic Reinforcement Learning with a Multi-Turn, Multi-Task Framework

Hanchen Zhang, Xiao Liu, Bowen Lv, Xueqiao Sun, Bohao Jing, Iat Long Iong, Zhenyu Hou, Zehan Qi, Hanyu Lai, Yifan Xu, Rui Lu, Hongning Wang, Jie Tang, Yuxiao Dong

Main category: cs.AI

TL;DR: AgentRL is a scalable framework for multi-turn, multi-task reinforcement learning training of LLM agents, featuring asynchronous infrastructure and stable training algorithms that outperform major LLM models.

DetailsMotivation: There is growing interest in building generalist agents through online interactions, but applying RL to train LLM agents in multi-turn, multi-task settings remains challenging due to infrastructure limitations and unstable training algorithms.

Method: AgentRL uses a fully-asynchronous generation-training pipeline, unified function-call API, containerized environments, and centralized controller for infrastructure. Algorithmically, it employs cross-policy sampling for exploration and task advantage normalization for stability.

Result: AgentRL significantly outperforms GPT-5, Clause-Sonnet-4, DeepSeek-R1, and other open-source LLM agents across five agentic tasks. Multi-task training matches the best results among all task-specific models.

Conclusion: AgentRL provides an effective framework for scalable multi-turn, multi-task RL training of LLM agents, demonstrating superior performance and stability compared to existing approaches.

Abstract: Recent advances in large language models (LLMs) have sparked growing interest in building generalist agents that can learn through online interactions. However, applying reinforcement learning (RL) to train LLM agents in multi-turn, multi-task settings remains challenging due to lack of scalable infrastructure and stable training algorithms. In this work, we present the AgentRL framework for scalable multi-turn, multi-task agentic RL training. On the infrastructure side, AgentRL features a fully-asynchronous generation-training pipeline for efficient multi-turn RL. To support heterogeneous environment development in multi-task RL, we design a unified function-call based API interface, containerized environment development, and a centralized controller. On the algorithm side, we propose cross-policy sampling to encourage model exploration in multi-turn settings and task advantage normalization to stabilize multi-task training. Experiments show that AgentRL, trained on open LLMs across five agentic tasks, significantly outperforms GPT-5, Clause-Sonnet-4, DeepSeek-R1, and other open-source LLM agents. Multi-task training with AgentRL matches the best results among all task-specific models. AgentRL is open-sourced at https://github.com/THUDM/AgentRL. The algorithm and framework are adopted in building \textsc{\href{https://autoglm.zhipuai.cn}{AutoGLM}}.

[553] Don’t Pass$\mathtt{@}k$: A Bayesian Framework for Large Language Model Evaluation

Mohsen Hariri, Amirhossein Samandar, Michael Hinczewski, Vipin Chaudhary

Main category: cs.AI

TL;DR: A Bayesian framework replaces Pass@k for LLM evaluation, providing stable rankings and uncertainty estimates with fewer samples.

DetailsMotivation: Pass@k yields unstable, misleading rankings when trials are limited and compute is constrained.

Method: Bayesian evaluation with Dirichlet prior, modeling outcomes as categorical and providing posterior estimates of success probability with credible intervals.

Result: Achieves faster convergence and greater rank stability than Pass@k, enabling reliable comparisons with smaller sample counts.

Conclusion: Recommends replacing Pass@k with posterior-based protocol that unifies binary and non-binary evaluation with explicit uncertainty.

Abstract: Pass$@k$ is widely used to report performance for LLM reasoning, but it often yields unstable, misleading rankings, especially when the number of trials (samples) is limited and compute is constrained. We present a principled Bayesian evaluation framework that replaces Pass$@k$ and average accuracy over $N$ trials (avg$@N$) with posterior estimates of a model’s underlying success probability and credible intervals, yielding stable rankings and a transparent decision rule for differences. Evaluation outcomes are modeled as categorical (not just 0/1) with a Dirichlet prior, giving closed-form expressions for the posterior mean and uncertainty of any weighted rubric and enabling the use of prior evidence when appropriate. Theoretically, under a uniform prior, the Bayesian posterior mean is order-equivalent to average accuracy (Pass$@1$), explaining its empirical robustness while adding principled uncertainty. Empirically, in simulations with known ground-truth success rates and on AIME'24/‘25, HMMT'25, and BrUMO'25, the Bayesian/avg procedure achieves faster convergence and greater rank stability than Pass$@k$ and recent variants, enabling reliable comparisons at far smaller sample counts. The framework clarifies when observed gaps are statistically meaningful (non-overlapping credible intervals) versus noise, and it naturally extends to graded, rubric-based evaluations. Together, these results recommend replacing Pass$@k$ for LLM evaluation and ranking with a posterior-based, compute-efficient protocol that unifies binary and non-binary evaluation while making uncertainty explicit. Code is available at https://mohsenhariri.github.io/bayes-kit

[554] Closing the Loop: Coordinating Inventory and Recommendation via Deep Reinforcement Learning on Multiple Timescales

Jinyang Jiang, Jinhui Han, Yijie Peng, Ying Zhang

Main category: cs.AI

TL;DR: A multi-agent reinforcement learning framework for cross-functional coordination between inventory replenishment and product recommendation, using multi-timescale learning to improve profitability and scalability.

DetailsMotivation: To address the challenge of cross-functional coordination in complex organizations using AI/RL, specifically focusing on the interplay between inventory management and personalized recommendations.

Method: Developed a unified multi-agent RL framework with multi-timescale architecture, decomposing policies by departmental functions and assigning different learning speeds based on task complexity.

Result: The approach significantly improved profitability compared to siloed decision-making, with RL agent behaviors aligning with theoretical insights and demonstrating asymptotic convergence.

Conclusion: Provides a scalable, interpretable RL-based solution for effective cross-functional coordination in complex business environments.

Abstract: Effective cross-functional coordination is essential for enhancing firm-wide profitability, particularly in the face of growing organizational complexity and scale. Recent advances in artificial intelligence, especially in reinforcement learning (RL), offer promising avenues to address this fundamental challenge. This paper proposes a unified multi-agent RL framework tailored for joint optimization across distinct functional modules, exemplified via coordinating inventory replenishment and personalized product recommendation. We first develop an integrated theoretical model to capture the intricate interplay between these functions and derive analytical benchmarks that characterize optimal coordination. The analysis reveals synchronized adjustment patterns across products and over time, highlighting the importance of coordinated decision-making. Leveraging these insights, we design a novel multi-timescale multi-agent RL architecture that decomposes policy components according to departmental functions and assigns distinct learning speeds based on task complexity and responsiveness. Our model-free multi-agent design improves scalability and deployment flexibility, while multi-timescale updates enhance convergence stability and adaptability across heterogeneous decisions. We further establish the asymptotic convergence of the proposed algorithm. Extensive simulation experiments demonstrate that the proposed approach significantly improves profitability relative to siloed decision-making frameworks, while the behaviors of the trained RL agents align closely with the managerial insights from our theoretical model. Taken together, this work provides a scalable, interpretable RL-based solution to enable effective cross-functional coordination in complex business settings.

[555] GROK: From Quantitative Biomarkers to Qualitative Diagnosis via a Grounded MLLM with Knowledge-Guided Instruction

Zhuangzhi Gao, Hongyi Qin, He Zhao, Qinkai Yu, Feixiang Zhou, Eduard Shantsila, Uazman Alam, Alena Shantsila, Wahbi El-Bouri, Gregory Y. H. Lip, Yalin Zheng

Main category: cs.AI

TL;DR: GROK is a grounded multimodal large language model that jointly processes color fundus photography, OCT, and text to deliver clinician-grade diagnoses of ocular and systemic disease through a quantitative-to-qualitative diagnostic chain of thought.

DetailsMotivation: Current medical MLLMs like LLaVA-Med fail to fully exploit the synergy between color fundus photography and OCT, and offer limited interpretability of quantitative biomarkers.

Method: GROK comprises three core modules: Knowledge-Guided Instruction Generation, CLIP-Style OCT-Biomarker Alignment, and Supervised Instruction Fine-Tuning, establishing a quantitative-to-qualitative diagnostic chain of thought that mirrors real clinical reasoning.

Result: With only LoRA fine-tuning of a 7B-parameter Qwen2 backbone, GROK outperforms comparable 7B and 32B baselines on both report quality and fine-grained clinical metrics, and even exceeds OpenAI o3.

Conclusion: GROK demonstrates superior performance in medical diagnosis tasks through its grounded multimodal approach and quantitative-to-qualitative reasoning chain, providing clinician-grade diagnoses with detailed lesion annotations.

Abstract: Multimodal large language models (MLLMs) hold promise for integrating diverse data modalities, but current medical adaptations such as LLaVA-Med often fail to fully exploit the synergy between color fundus photography (CFP) and optical coherence tomography (OCT), and offer limited interpretability of quantitative biomarkers. We introduce GROK, a grounded multimodal large language model that jointly processes CFP, OCT, and text to deliver clinician-grade diagnoses of ocular and systemic disease. GROK comprises three core modules: Knowledge-Guided Instruction Generation, CLIP-Style OCT-Biomarker Alignment, and Supervised Instruction Fine-Tuning, which together establish a quantitative-to-qualitative diagnostic chain of thought, mirroring real clinical reasoning when producing detailed lesion annotations. To evaluate our approach, we introduce the Grounded Ophthalmic Understanding benchmark, which covers six disease categories and three tasks: macro-level diagnostic classification, report generation quality, and fine-grained clinical assessment of the generated chain of thought. Experiments show that, with only LoRA (Low-Rank Adaptation) fine-tuning of a 7B-parameter Qwen2 backbone, GROK outperforms comparable 7B and 32B baselines on both report quality and fine-grained clinical metrics, and even exceeds OpenAI o3. Code and data are publicly available in the GROK repository.

[556] Doctor-R1: Mastering Clinical Inquiry with Experiential Agentic Reinforcement Learning

Yunghwei Lai, Kaiming Liu, Ziyue Wang, Weizhi Ma, Yang Liu

Main category: cs.AI

TL;DR: Doctor-R1 is an AI doctor agent that masters both medical decision-making and strategic empathetic consultation through multi-agent training, outperforming state-of-the-art models in clinical dialogue quality.

DetailsMotivation: Existing LLMs achieve high accuracy in medical decision-making but lack strategic, empathetic consultation skills essential for real-world clinical scenarios.

Method: Multi-agent interactive environment with two-tiered reward architecture (separately optimizing decision-making and communication) and experience repository for policy learning.

Result: Surpasses state-of-the-art open-source specialized LLMs with higher parameter efficiency, outperforms proprietary models, and shows strong human preference in clinical dialogue generation.

Conclusion: The framework effectively enables AI doctors to master both medical decision-making and empathetic consultation skills, demonstrating practical clinical value.

Abstract: The professionalism of a human doctor in outpatient service depends on two core abilities: the ability to make accurate medical decisions and the medical consultation skill to conduct strategic, empathetic patient inquiry. Existing Large Language Models (LLMs) have achieved remarkable accuracy on medical decision-making benchmarks. However, they often lack the ability to conduct the strategic and empathetic consultation, which is essential for real-world clinical scenarios. To address this gap, we propose Doctor-R1, an AI doctor agent trained to master both of the capabilities by ask high-yield questions and conduct strategic multi-turn inquiry to guide decision-making. Our framework introduces three key components: a multi-agent interactive environment, a two-tiered reward architecture that separately optimizes clinical decision-making and communicative inquiry skills, and an experience repository to ground policy learning in high-quality prior trajectories. We evaluate Doctor-R1 on OpenAI’s HealthBench and MAQuE, assessed across multi-facet metrics, such as communication quality, user experience, and task accuracy. Remarkably, Doctor-R1 surpasses state-of-the-art open-source specialized LLMs by a substantial margin with higher parameter efficiency and outperforms powerful proprietary models. Furthermore, the human evaluations show a strong preference for Doctor-R1 to generate human-preferred clinical dialogue, demonstrating the effectiveness of the framework.

[557] On the Importance of Task Complexity in Evaluating LLM-Based Multi-Agent Systems

Bohan Tang, Huidong Liang, Keyue Jiang, Xiaowen Dong

Main category: cs.AI

TL;DR: LLM multi-agent systems outperform single-agent systems more significantly as task complexity increases, particularly with longer reasoning chains and diverse capability requirements.

DetailsMotivation: To systematically understand when and why LLM multi-agent systems outperform single-agent systems, addressing limitations in current experimental designs.

Method: Proposed a theoretical framework with depth (reasoning length) and width (capability diversity) dimensions, then empirically evaluated multi-agent debate systems on discriminative and generative tasks.

Result: LLM-MAS benefits increase with both task depth and width, with depth having a more pronounced effect on performance improvement.

Conclusion: The framework clarifies when LLM-MAS are beneficial and provides principled guidance for designing future multi-agent systems and benchmarks.

Abstract: Large language model multi-agent systems (LLM-MAS) offer a promising paradigm for harnessing collective intelligence to achieve more advanced forms of AI behaviour. While recent studies suggest that LLM-MAS can outperform LLM single-agent systems (LLM-SAS) on certain tasks, the lack of systematic experimental designs limits the strength and generality of these conclusions. We argue that a principled understanding of task complexity, such as the degree of sequential reasoning required and the breadth of capabilities involved, is essential for assessing the effectiveness of LLM-MAS in task solving. To this end, we propose a theoretical framework characterising tasks along two dimensions: depth, representing reasoning length, and width, representing capability diversity. We theoretically examine a representative class of LLM-MAS, namely the multi-agent debate system, and empirically evaluate its performance in both discriminative and generative tasks with varying depth and width. Theoretical and empirical results show that the benefit of LLM-MAS over LLM-SAS increases with both task depth and width, and the effect is more pronounced with respect to depth. This clarifies when LLM-MAS are beneficial and provides a principled foundation for designing future LLM-MAS methods and benchmarks.

[558] Just-in-time Episodic Feedback Hinter: Leveraging Offline Knowledge to Improve LLM Agents Adaptation

Hadi Nekoei, Aman Jaiswal, Patrice Bechard, Oleh Shliazhko, Orlando Marquez Ayala, Mathieu Reymond, Massimo Caccia, Alexandre Drouin, Sarath Chandar, Alexandre Lacoste

Main category: cs.AI

TL;DR: JEF Hinter is an agentic system that distills offline trajectories into compact, context-aware hints to improve LLM agents’ performance without costly online interactions or fine-tuning.

DetailsMotivation: Current methods for improving LLM agents on unfamiliar domains require expensive online interactions or fine-tuning, which are impractical for closed-source models and risk catastrophic forgetting. Offline trajectories contain reusable knowledge but are too long, noisy, and task-specific for direct use.

Method: JEF Hinter uses a zooming mechanism to highlight decisive steps in long trajectories, distilling both successful and failed traces into compact hints. It supports parallelized hint generation and includes a retriever that selects relevant hints for the current state during inference.

Result: Experiments on MiniWoB++, WorkArena-L1, and WebArena-Lite show JEF Hinter consistently outperforms strong baselines, including human- and document-based hints.

Conclusion: JEF Hinter effectively leverages offline trajectories to provide targeted guidance for LLM agents, demonstrating improved performance across multiple benchmarks while offering transparency and traceability.

Abstract: Large language model (LLM) agents perform well in sequential decision-making tasks, but improving them on unfamiliar domains often requires costly online interactions or fine-tuning on large expert datasets. These strategies are impractical for closed-source models and expensive for open-source ones, with risks of catastrophic forgetting. Offline trajectories offer reusable knowledge, yet demonstration-based methods struggle because raw traces are long, noisy, and tied to specific tasks. We present Just-in-time Episodic Feedback Hinter (JEF Hinter), an agentic system that distills offline traces into compact, context-aware hints. A zooming mechanism highlights decisive steps in long trajectories, capturing both strategies and pitfalls. Unlike prior methods, JEF Hinter leverages both successful and failed trajectories, extracting guidance even when only failure data is available, while supporting parallelized hint generation and benchmark-independent prompting. At inference, a retriever selects relevant hints for the current state, providing targeted guidance with transparency and traceability. Experiments on MiniWoB++, WorkArena-L1, and WebArena-Lite show that JEF Hinter consistently outperforms strong baselines, including human- and document-based hints.

[559] LLM Based Bayesian Optimization for Prompt Search

Adam Ballew, Jingbo Wang, Shaogang Ren

Main category: cs.AI

TL;DR: BO-LLM algorithm uses Bayesian Optimization with LLM-powered Gaussian Process for prompt engineering to improve text classification accuracy while reducing API calls.

DetailsMotivation: To efficiently optimize expensive black-box functions (prompt engineering) with limited evaluations for enhancing text classification with Large Language Models.

Method: Uses LLM-powered Gaussian Process as surrogate model, generates prompt candidates via LLM expansion of seed prompts, evaluates with UCB acquisition function, and iteratively refines prompts based on data subset.

Result: The proposed BO-LLM algorithm was evaluated on two datasets and showed advantages in improving classification performance.

Conclusion: Bayesian Optimization with LLM-powered GP is effective for prompt engineering in text classification tasks, achieving better accuracy while minimizing API calls through uncertainty-aware optimization.

Abstract: Bayesian Optimization (BO) has been widely used to efficiently optimize expensive black-box functions with limited evaluations. In this paper, we investigate the use of BO for prompt engineering to enhance text classification with Large Language Models (LLMs). We employ an LLM-powered Gaussian Process (GP) as the surrogate model to estimate the performance of different prompt candidates. These candidates are generated by an LLM through the expansion of a set of seed prompts and are subsequently evaluated using an Upper Confidence Bound (UCB) acquisition function in conjunction with the GP posterior. The optimization process iteratively refines the prompts based on a subset of the data, aiming to improve classification accuracy while reducing the number of API calls by leveraging the prediction uncertainty of the LLM-based GP. The proposed BO-LLM algorithm is evaluated on two datasets, and its advantages are discussed in detail in this paper.

[560] Internal World Models as Imagination Networks in Cognitive Agents

Saurabh Ranjan, Brian Odegaard

Main category: cs.AI

TL;DR: The study proposes that imagination serves to access an internal world model (IWM) and uses psychological network analysis to compare IWMs in humans and LLMs, finding significant differences in network structure and centrality correlations.

DetailsMotivation: To understand the computational objective of imagination and challenge classical views that imagination is primarily for reward maximization, by exploring internal world models in both humans and AI systems.

Method: Used psychological network analysis to assess imagination vividness ratings from questionnaires, constructing imagination networks from human reports and comparing them with networks generated by large language models under different prompts and memory conditions.

Result: Human imagination networks showed correlations between centrality measures (expected influence, strength, closeness), while LLM networks lacked clustering and showed lower correlations between centrality measures across different conditions.

Conclusion: There is a lack of similarity between internal world models in humans and LLM agents, providing a novel method for comparing internally-generated representations and insights for developing human-like imagination in AI.

Abstract: What is the computational objective of imagination? While classical interpretations suggest imagination is useful for maximizing rewards, recent findings challenge this view. In this study, we propose that imagination serves to access an internal world model (IWM) and use psychological network analysis to explore IWMs in humans and large language models (LLMs). Specifically, we assessed imagination vividness ratings using two questionnaires and constructed imagination networks from these reports. Imagination networks from human groups showed correlations between different centrality measures, including expected influence, strength, and closeness. However, imagination networks from LLMs showed a lack of clustering and lower correlations between centrality measures under different prompts and conversational memory conditions. Together, these results indicate a lack of similarity between IWMs in human and LLM agents. Overall, our study offers a novel method for comparing internally-generated representations in humans and AI, providing insights for developing human-like imagination in artificial intelligence.

[561] Utility-Learning Tension in Self-Modifying Agents

Charles L. Wang, Keir Dorchen, Peter Jin

Main category: cs.AI

TL;DR: The paper identifies a fundamental tension in self-modifying AI systems where utility-driven improvements can undermine learning capabilities, and provides theoretical conditions for safe self-modification.

DetailsMotivation: As AI systems approach superintelligence, understanding how self-improvement affects learning capabilities becomes crucial, particularly the potential conflict between utility optimization and learning reliability.

Method: The authors formalize self-modification using a five-axis decomposition with a decision layer, analyze structural conflicts theoretically, and validate with numerical experiments comparing destructive utility policies against proposed two-gate policies.

Result: The central finding reveals that utility-driven self-modifications can render learnable tasks unlearnable when model capacity grows without bounds, but learnability is preserved when the policy-reachable model family is uniformly capacity-bounded.

Conclusion: The paper establishes a single boundary criterion for safe self-modification and demonstrates that carefully designed policies (two-gate policies) can preserve learnability while allowing beneficial self-improvements.

Abstract: As systems trend toward superintelligence, a natural modeling premise is that agents can self-improve along every facet of their own design. We formalize this with a five-axis decomposition and a decision layer, separating incentives from learning behavior and analyzing axes in isolation. Our central result identifies and introduces a sharp utility–learning tension, the structural conflict in self-modifying systems whereby utility-driven changes that improve immediate or expected performance can also erode the statistical preconditions for reliable learning and generalization. Our findings show that distribution-free guarantees are preserved iff the policy-reachable model family is uniformly capacity-bounded; when capacity can grow without limit, utility-rational self-changes can render learnable tasks unlearnable. Under standard assumptions common in practice, these axes reduce to the same capacity criterion, yielding a single boundary for safe self-modification. Numerical experiments across several axes validate the theory by comparing destructive utility policies against our proposed two-gate policies that preserve learnability.

[562] DRPO: Efficient Reasoning via Decoupled Reward Policy Optimization

Gang Li, Yan Chen, Ming Lin, Tianbao Yang

Main category: cs.AI

TL;DR: DRPO is a novel framework that decouples length-based rewards for correct vs incorrect reasoning to prevent overthinking in large reasoning models, achieving significant length reduction with minimal performance loss.

DetailsMotivation: Large reasoning models suffer from overthinking - generating unnecessarily long reasoning even for simple questions, which increases computational costs and response latency. Existing methods that penalize length cause significant performance degradation.

Method: DRPO decouples length-based learning signals for correct rollouts from incorrect ones, ensuring positive samples are normalized only within their own group. It uses an optimized positive data distribution with KL regularization and importance weighting.

Result: DRPO achieves 77% length reduction with only 1.1% performance loss on GSM8k dataset using a 1.5B model, significantly outperforming baselines that sacrifice 4.3% performance for 68% length reduction.

Conclusion: DRPO effectively addresses overthinking in reasoning models by decoupling reward signals, enabling substantial computational efficiency gains without significant performance degradation. The framework is generalizable to other preference rewards beyond length.

Abstract: Recent large reasoning models (LRMs) driven by reinforcement learning algorithms (e.g., GRPO) have achieved remarkable performance on challenging reasoning tasks. However, these models suffer from overthinking, generating unnecessarily long and redundant reasoning even for simple questions, which substantially increases computational cost and response latency. While existing methods incorporate length rewards to GRPO to promote concise reasoning, they incur significant performance degradation. We identify the root cause: when rewards for correct but long rollouts are penalized, GRPO’s group-relative advantage function can assign them negative advantages, actively discouraging valid reasoning. To overcome this, we propose Decoupled Reward Policy Optimization (DRPO), a novel framework that decouples the length-based learning signal of correct rollouts from incorrect ones. DRPO ensures that reward signals for correct rollouts are normalized solely within the positive group, shielding them from interference by negative samples. The DRPO’s objective is grounded in integrating an optimized positive data distribution, which maximizes length-based rewards under a KL regularization, into a discriminative objective. We derive a closed-form solution for this distribution, enabling efficient computation of the objective and its gradients using only on-policy data and importance weighting. Of independent interest, this formulation is general and can incorporate other preference rewards of positive data beyond length. Experiments on mathematical reasoning tasks demonstrate DRPO’s significant superiority over six efficient reasoning baselines. Notably, with a 1.5B model, our method achieves 77% length reduction with only 1.1% performance loss on simple questions like GSM8k dataset, while the follow-up baseline sacrifices 4.3% for 68% length reduction.

[563] On Continuous Optimization for Constraint Satisfaction Problems

Yunuo Cen, Zixuan Wang, Jintao Zhang, Zhiwei Zhang, Xuanyao Fong

Main category: cs.AI

TL;DR: FourierCSP extends continuous local search from Boolean SAT to general CSPs using Walsh-Fourier transform to convert constraints to compact multilinear polynomials, achieving competitive performance without auxiliary variables.

DetailsMotivation: To extend the success of continuous local search solvers from Boolean SAT to general constraint satisfaction problems with finite-domain variables and expressive constraints.

Method: Uses Walsh-Fourier transform to convert CSP constraints to compact multilinear polynomials, employs circuit-output probability for efficient evaluation and differentiation, and uses projected gradient optimization with theoretical guarantees.

Result: Empirical results show FourierCSP is scalable and competitive, significantly broadening the class of problems solvable by continuous local search techniques.

Conclusion: The FourierCSP framework successfully generalizes continuous local search to general CSPs, demonstrating competitive performance and expanding the applicability of CLS methods.

Abstract: Constraint satisfaction problems (CSPs) are fundamental in mathematics, physics, and theoretical computer science. While conflict-driven clause learning Boolean Satisfiability (SAT) solvers have achieved remarkable success and become the mainstream approach for Boolean satisfiability, recent advances show that modern continuous local search (CLS) solvers can achieve highly competitive results on certain classes of SAT problems. Motivated by these advances, we extend the CLS framework from Boolean SAT to general CSP with finite-domain variables and expressive constraints. We present FourierCSP, a continuous optimization framework that generalizes the Walsh-Fourier transform to CSP, allowing for transforming versatile constraints to compact multilinear polynomials, thereby avoiding the need for auxiliary variables and memory-intensive encodings. Our approach leverages efficient evaluation and differentiation of the objective via circuit-output probability and employs a projected gradient optimization method with theoretical guarantees. Empirical results on benchmark suites demonstrate that FourierCSP is scalable and competitive, significantly broadening the class of problems that can be efficiently solved by CLS techniques.

[564] Multi-Agent Collaborative Intelligence: Dual-Dial Control for Reliable LLM Reasoning

Edward Y. Chang, Ethan Y. Chang

Main category: cs.AI

TL;DR: MACI is a multi-agent debate controller with two dials: an information dial that gates evidence by quality, and a behavior dial that schedules contentiousness from exploration to consolidation. It uses a moderator to track metrics and halt when gains plateau, providing theoretical guarantees for nonincreasing dispersion and provable termination.

DetailsMotivation: Traditional multi-agent debate wastes compute by using fixed adversarial stances, aggregating without deliberation, or stopping on heuristics. There's a need for more efficient and controlled debate processes.

Method: MACI uses two independent dials: information dial (gates evidence by quality) and behavior dial (schedules contentiousness from exploration to consolidation). A moderator tracks disagreement, overlap, evidence quality, and argument quality, halting when gains plateau. It includes a budget-feasible scheduler and uses cross-family LLM judge (CRIT) as conservative soft weight and stop signal.

Result: Across clinical diagnosis and news-bias tasks, MACI improves accuracy and calibration while reducing tokens. It converts residual uncertainty into precision RAG plans that specify what to retrieve next. The system demonstrates order invariance and judge-swap stability when using high-capability judges.

Conclusion: MACI turns debate into a budget-aware, measurable, and provably terminating controller that efficiently manages multi-agent deliberation while maintaining theoretical guarantees.

Abstract: Multi-agent debate often wastes compute by using a fixed adversarial stance, aggregating without deliberation, or stopping on heuristics. We introduce MACI, an active controller with two independent dials that decouple information from behavior: an information dial that gates evidence by quality, and a behavior dial that schedules contentiousness from exploration to consolidation. A moderator tracks disagreement, overlap, evidence quality, and argument quality, and halts when gains plateau. We provide theory-lite guarantees for nonincreasing dispersion and provable termination, with a budget-feasible scheduler. Across clinical diagnosis and news-bias tasks, MACI improves accuracy and calibration while reducing tokens, and converts residual uncertainty into precision RAG plans that specify what to retrieve next. We use a cross-family LLM judge (CRIT) as a conservative soft weight and stop signal, validated for order invariance and judge-swap stability; stability depends on using high-capability judges. MACI turns debate into a budget-aware, measurable, and provably terminating controller.

[565] Impatient Users Confuse AI Agents: High-fidelity Simulations of Human Traits for Testing Agents

Muyu He, Anand Kumar, Tsach Mackey, Meghana Rajeev, James Zou, Nazneen Rajani

Main category: cs.AI

TL;DR: TraitBasis is a model-agnostic method for stress testing AI agents by learning steerable user trait directions in activation space, revealing significant performance degradation (2%-30%) when agents face varied user behaviors like impatience or incoherence.

DetailsMotivation: Current conversational AI agents are brittle and fail under small shifts in user behavior, while existing benchmarks don't capture this fragility. There's a critical need for systematic robustness testing.

Method: TraitBasis learns directions in activation space corresponding to user traits (impatience, incoherence, skepticism) that can be controlled, scaled, composed, and applied at inference time without fine-tuning. It extends τ-Bench to τ-Trait with controlled trait vectors.

Result: Average 2%-30% performance degradation on τ-Trait across frontier models, demonstrating current AI agents’ lack of robustness to user behavior variations.

Conclusion: TraitBasis provides a simple, data-efficient, compositional tool for robustness testing and opens the door to building more reliable AI agents for real-world human interactions. The method has been open-sourced across four domains.

Abstract: Despite rapid progress in building conversational AI agents, robustness is still largely untested. Small shifts in user behavior, such as being more impatient, incoherent, or skeptical, can cause sharp drops in agent performance, revealing how brittle current AI agents are. Today’s benchmarks fail to capture this fragility: agents may perform well under standard evaluations but degrade spectacularly in more realistic and varied settings. We address this robustness testing gap by introducing TraitBasis, a lightweight, model-agnostic method for systematically stress testing AI agents. TraitBasis learns directions in activation space corresponding to steerable user traits (e.g., impatience or incoherence), which can be controlled, scaled, composed, and applied at inference time without any fine-tuning or extra data. Using TraitBasis, we extend $\tau$-Bench to $\tau$-Trait, where user behaviors are altered via controlled trait vectors. We observe on average a 2%-30% performance degradation on $\tau$-Trait across frontier models, highlighting the lack of robustness of current AI agents to variations in user behavior. Together, these results highlight both the critical role of robustness testing and the promise of TraitBasis as a simple, data-efficient, and compositional tool. By powering simulation-driven stress tests and training loops, TraitBasis opens the door to building AI agents that remain reliable in the unpredictable dynamics of real-world human interactions. We have open-sourced $\tau$-Trai across four domains: airline, retail, telecom, and telehealth, so the community can systematically QA their agents under realistic, behaviorally diverse intents and trait scenarios: https://github.com/collinear-ai/tau-trait.

[566] ChartAgent: A Multimodal Agent for Visually Grounded Reasoning in Complex Chart Question Answering

Rachneet Kaur, Nishan Srishankar, Zhen Zeng, Sumitra Ganesh, Manuela Veloso

Main category: cs.AI

TL;DR: ChartAgent is a novel agentic framework that performs visual reasoning directly on chart images using specialized vision tools, achieving state-of-the-art performance on chart understanding benchmarks.

DetailsMotivation: Existing multimodal LLMs perform poorly on unannotated charts that require precise visual interpretation rather than relying on textual shortcuts.

Method: Iteratively decomposes queries into visual subtasks and manipulates chart images through specialized actions like drawing annotations, cropping regions, and localizing axes using chart-specific vision tools.

Result: Achieves state-of-the-art accuracy on ChartBench and ChartX benchmarks, with up to 16.07% absolute gain overall and 17.31% on unannotated, numerically intensive queries.

Conclusion: ChartAgent demonstrates effective visually grounded reasoning for chart understanding using tool-augmented multimodal agents, working across diverse chart types and complexity levels.

Abstract: Recent multimodal LLMs have shown promise in chart-based visual question answering, but their performance declines sharply on unannotated charts, those requiring precise visual interpretation rather than relying on textual shortcuts. To address this, we introduce ChartAgent, a novel agentic framework that explicitly performs visual reasoning directly within the chart’s spatial domain. Unlike textual chain-of-thought reasoning, ChartAgent iteratively decomposes queries into visual subtasks and actively manipulates and interacts with chart images through specialized actions such as drawing annotations, cropping regions (e.g., segmenting pie slices, isolating bars), and localizing axes, using a library of chart-specific vision tools to fulfill each subtask. This iterative reasoning process closely mirrors human cognitive strategies for chart comprehension. ChartAgent achieves state-of-the-art accuracy on the ChartBench and ChartX benchmarks, surpassing prior methods by up to 16.07% absolute gain overall and 17.31% on unannotated, numerically intensive queries. Furthermore, our analyses show that ChartAgent is (a) effective across diverse chart types, (b) achieve the highest scores across varying visual and reasoning complexity levels, and (c) serves as a plug-and-play framework that boosts performance across diverse underlying LLMs. Our work is among the first to demonstrate visually grounded reasoning for chart understanding using tool-augmented multimodal agents.

[567] Aria: An Agent For Retrieval and Iterative Auto-Formalization via Dependency Graph

Hanyu Wang, Ruohan Xie, Yutong Wang, Guoxiong Gao, Xintao Yu, Bin Dong

Main category: cs.AI

TL;DR: Aria is a system for auto-formalizing theorem statements in Lean using a two-phase Graph-of-Thought process and AriaScorer for verification, achieving state-of-the-art performance across multiple benchmarks.

DetailsMotivation: Auto-formalization of theorem statements is crucial for automated mathematics but remains challenging for LLMs due to hallucinations, semantic mismatches, and inability to synthesize new definitions.

Method: Two-phase Graph-of-Thought process: recursively decomposing statements into dependency graphs, then constructing formalizations from grounded concepts, with AriaScorer for term-level verification using Mathlib definitions.

Result: Achieved 91.6% compilation success and 68.5% final accuracy on ProofNet, 44.0% vs 24.0% on FATE-X algebra problems, and 42.9% vs 0% on homological conjectures.

Conclusion: Aria significantly advances theorem formalization by emulating human reasoning and ensuring semantic correctness, outperforming existing methods across diverse mathematical domains.

Abstract: Accurate auto-formalization of theorem statements is essential for advancing automated discovery and verification of research-level mathematics, yet remains a major bottleneck for LLMs due to hallucinations, semantic mismatches, and their inability to synthesize new definitions. To tackle these issues, we present Aria (Agent for Retrieval and Iterative Autoformalization), a system for conjecture-level formalization in Lean that emulates human expert reasoning via a two-phase Graph-of-Thought process: recursively decomposing statements into a dependency graph and then constructing formalizations from grounded concepts. To ensure semantic correctness, we introduce AriaScorer, a checker that retrieves definitions from Mathlib for term-level grounding, enabling rigorous and reliable verification. We evaluate Aria on diverse benchmarks. On ProofNet, it achieves 91.6% compilation success rate and 68.5% final accuracy, surpassing previous methods. On FATE-X, a suite of challenging algebra problems from research literature, it outperforms the best baseline with 44.0% vs. 24.0% final accuracy. On a dataset of homological conjectures, Aria reaches 42.9% final accuracy while all other models score 0%.

[568] More Than Meets the Eye? Uncovering the Reasoning-Planning Disconnect in Training Vision-Language Driving Models

Xurui Song, Shuo Huai, JingJing Jiang, Jiayi Kong, Jun Luo

Main category: cs.AI

TL;DR: DriveMind dataset reveals causal disconnect between reasoning and planning in VLM driving agents - planning relies more on priors than on generated reasoning.

DetailsMotivation: To investigate whether trajectory planning in VLM driving agents is causally driven by their natural-language reasoning, challenging the assumption that reasoning mediates planning.

Method: Built DriveMind dataset from nuPlan with plan-aligned Chain-of-Thought, trained VLM agents with SFT and GRPO, conducted information ablations and attention analysis to test causal relationships.

Result: Consistent causal disconnect found: removing ego/navigation priors causes large planning drops, while removing CoT causes minor changes. Planning primarily focuses on priors rather than reasoning.

Conclusion: Proposed Reasoning-Planning Decoupling Hypothesis - reasoning is an ancillary byproduct, not causal mediator. Introduced training-free diagnostic probe to measure reliance on priors.

Abstract: Vision-Language Model (VLM) driving agents promise explainable end-to-end autonomy by first producing natural-language reasoning and then predicting trajectory planning. However, whether planning is causally driven by this reasoning remains a critical but unverified assumption. To investigate this, we build DriveMind, a large-scale driving Visual Question Answering (VQA) corpus with plan-aligned Chain-of-Thought (CoT), automatically generated from nuPlan. Our data generation process converts sensors and annotations into structured inputs and, crucially, separates priors from to-be-reasoned signals, enabling clean information ablations. Using DriveMind, we train representative VLM agents with Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO) and evaluate them with nuPlan’s metrics. Our results, unfortunately, indicate a consistent causal disconnect in reasoning-planning: removing ego/navigation priors causes large drops in planning scores, whereas removing CoT produces only minor changes. Attention analysis further shows that planning primarily focuses on priors rather than the CoT. Based on this evidence, we propose the Reasoning-Planning Decoupling Hypothesis, positing that the training-yielded reasoning is an ancillary byproduct rather than a causal mediator. To enable efficient diagnosis, we also introduce a novel, training-free probe that measures an agent’s reliance on priors by evaluating its planning robustness against minor input perturbations. In summary, we provide the community with a new dataset and a diagnostic tool to evaluate the causal fidelity of future models.

[569] Code World Models for General Game Playing

Wolfgang Lehrach, Daniel Hennes, Miguel Lazaro-Gredilla, Xinghua Lou, Carter Wendelken, Zun Li, Antoine Dedieu, Jordi Grau-Moya, Marc Lanctot, Atil Iscen, John Schultz, Marcus Chiam, Ian Gemp, Piotr Zielinski, Satinder Singh, Kevin P. Murphy

Main category: cs.AI

TL;DR: LLMs are used to translate game rules into executable Python code models, enabling verifiable planning algorithms like MCTS instead of direct move generation, achieving better performance than direct LLM play.

DetailsMotivation: Current LLM approaches for games rely on fragile pattern matching, leading to illegal moves and shallow strategy. A more robust method is needed that combines LLM semantic understanding with classical planning algorithms.

Method: Use LLM to translate natural language rules into formal Python code models with state transition, legal moves, and termination functions. Generate heuristic value functions and inference functions for MCTS planning.

Result: Outperformed or matched Gemini 2.5 Pro in 9 out of 10 games tested (5 perfect information, 5 imperfect information games, including 4 novel games).

Conclusion: The code-based world modeling approach provides verifiability, strategic depth, and generalization advantages over direct LLM policy generation for game playing.

Abstract: Large Language Models (LLMs) reasoning abilities are increasingly being applied to classical board and card games, but the dominant approach – involving prompting for direct move generation – has significant drawbacks. It relies on the model’s implicit fragile pattern-matching capabilities, leading to frequent illegal moves and strategically shallow play. Here we introduce an alternative approach: We use the LLM to translate natural language rules and game trajectories into a formal, executable world model represented as Python code. This generated model – comprising functions for state transition, legal move enumeration, and termination checks – serves as a verifiable simulation engine for high-performance planning algorithms like Monte Carlo tree search (MCTS). In addition, we prompt the LLM to generate heuristic value functions (to make MCTS more efficient), and inference functions (to estimate hidden states in imperfect information games). Our method offers three distinct advantages compared to directly using the LLM as a policy: (1) Verifiability: The generated CWM serves as a formal specification of the game’s rules, allowing planners to algorithmically enumerate valid actions and avoid illegal moves, contingent on the correctness of the synthesized model; (2) Strategic Depth: We combine LLM semantic understanding with the deep search power of classical planners; and (3) Generalization: We direct the LLM to focus on the meta-task of data-to-code translation, enabling it to adapt to new games more easily. We evaluate our agent on 10 different games, of which 4 are novel and created for this paper. 5 of the games are fully observed (perfect information), and 5 are partially observed (imperfect information). We find that our method outperforms or matches Gemini 2.5 Pro in 9 out of the 10 considered games.

[570] TRAJECT-Bench:A Trajectory-Aware Benchmark for Evaluating Agentic Tool Use

Pengfei He, Zhenwei Dai, Bing He, Hui Liu, Xianfeng Tang, Hanqing Lu, Juanhui Li, Jiayuan Ding, Subhabrata Mukherjee, Suhang Wang, Yue Xing, Jiliang Tang, Benoit Dumoulin

Main category: cs.AI

TL;DR: TRAJECT-Bench is a trajectory-aware benchmark that comprehensively evaluates LLMs’ tool use capability through fine-grained metrics on tool selection, parameterization, and ordering across diverse tasks.

DetailsMotivation: Existing evaluations of LLM tool use focus mainly on final answers while overlooking detailed tool usage trajectories, including correct tool selection, parameterization, and ordering.

Method: TRAJECT-Bench pairs high-fidelity, executable tools across practical domains with production-style API tasks, and synthesizes trajectories varying in breadth (parallel calls) and depth (interdependent chains).

Result: The benchmark reveals failure modes like similar tool confusion and parameter-blind selection, and shows scaling behavior with tool diversity and trajectory length, identifying bottlenecks in transitioning from short to mid-length trajectories.

Conclusion: TRAJECT-Bench provides actionable guidance for improving LLMs’ tool use by offering trajectory-level diagnostics and revealing critical failure patterns and scaling limitations.

Abstract: Large language model (LLM)-based agents increasingly rely on tool use to complete real-world tasks. While existing works evaluate the LLMs’ tool use capability, they largely focus on the final answers yet overlook the detailed tool usage trajectory, i.e., whether tools are selected, parameterized, and ordered correctly. We introduce TRAJECT-Bench, a trajectory-aware benchmark to comprehensively evaluate LLMs’ tool use capability through diverse tasks with fine-grained evaluation metrics. TRAJECT-Bench pairs high-fidelity, executable tools across practical domains with tasks grounded in production-style APIs, and synthesizes trajectories that vary in breadth (parallel calls) and depth (interdependent chains). Besides final accuracy, TRAJECT-Bench also reports trajectory-level diagnostics, including tool selection and argument correctness, and dependency/order satisfaction. Analyses reveal failure modes such as similar tool confusion and parameter-blind selection, and scaling behavior with tool diversity and trajectory length where the bottleneck of transiting from short to mid-length trajectories is revealed, offering actionable guidance for LLMs’ tool use.

[571] ContextNav: Towards Agentic Multimodal In-Context Learning

Honghao Fu, Yuan Ouyang, Kai-Wei Chang, Yiwei Wang, Zi Huang, Yujun Cai

Main category: cs.AI

TL;DR: ContextNav is an agentic framework that combines automated retrieval with human-like curation for multimodal in-context learning, addressing scalability and robustness challenges through graph-based orchestration and noise-resilient contextualization.

DetailsMotivation: Existing multimodal ICL approaches struggle to balance scalability and robustness - manual example selection is labor-intensive while similarity-based retrieval can introduce irrelevant or structurally inconsistent samples that degrade performance.

Method: ContextNav integrates automated retrieval with agentic curation using a graph-based orchestration workflow. It builds a resource-aware multimodal embedding pipeline, maintains a retrievable vector database, and applies agentic retrieval with structural alignment. An Operational Grammar Graph enables adaptive workflow planning and optimization based on downstream ICL feedback.

Result: Experimental results show ContextNav achieves state-of-the-art performance across various datasets, demonstrating effective noise-robust contextualization for multimodal ICL.

Conclusion: ContextNav demonstrates the promise of agentic workflows for advancing scalable and robust contextualization in multimodal in-context learning, successfully addressing the trade-off between scalability and robustness.

Abstract: Recent advances demonstrate that multimodal large language models (MLLMs) exhibit strong multimodal in-context learning (ICL) capabilities, enabling them to adapt to novel vision-language tasks from a few contextual examples. However, existing ICL approaches face challenges in reconciling scalability with robustness across diverse tasks and noisy contextual examples: manually selecting examples produces clean contexts but is labor-intensive and task-specific, while similarity-based retrieval improves scalability but could introduce irrelevant or structurally inconsistent samples that degrade ICL performance. To address these limitations, we propose ContextNav, the first agentic framework that integrates the scalability of automated retrieval with the quality and adaptiveness of human-like curation, enabling noise-robust and dynamically optimized contextualization for multimodal ICL. ContextNav unifies context management and noise-robust contextualization within a closed-loop workflow driven by graph-based orchestration. Specifically, it builds a resource-aware multimodal embedding pipeline, maintains a retrievable vector database, and applies agentic retrieval and structural alignment to construct noise-resilient contexts. An Operational Grammar Graph (OGG) further supports adaptive workflow planning and optimization, enabling the agent to refine its operational strategies based on downstream ICL feedback. Experimental results demonstrate that ContextNav achieves state-of-the-art performance across various datasets, underscoring the promise of agentic workflows for advancing scalable and robust contextualization in multimodal ICL.

[572] COSMIR: Chain Orchestrated Structured Memory for Iterative Reasoning over Long Context

Naman Gupta, Shreeyash Gowaikar, Arun Iyer, Kirankumar Shiragur, Ramakrishna B Bairi, Rishikesh Maurya, Ritabrata Maiti, Sankarshan Damle, Shachee Mishra Gupta

Main category: cs.AI

TL;DR: COSMIR is a chain-style framework that uses structured memory instead of free-form summaries to improve reasoning over long inputs, reducing information loss and improving accuracy.

DetailsMotivation: Current methods for handling long inputs in LLMs have limitations: retrieval-based approaches risk missing evidence, enlarged context windows strain selectivity, and staged pipelines using free-form summaries can discard crucial details and amplify early mistakes.

Method: COSMIR uses a structured memory approach with three agent types: Planner converts queries into checkable sub-questions, Worker agents process chunks using a fixed micro-cycle (Extract, Infer, Refine) and update shared memory, and Manager synthesizes final answers from the structured memory.

Result: On long-context QA from the HELMET suite, COSMIR reduces propagation-stage information loss and improves accuracy over Chain of Agents (CoA) baseline.

Conclusion: COSMIR preserves step-wise read-then-reason benefits while using structured memory and fixed worker procedures, yielding higher faithfulness, better long-range aggregation, and auditability for long-input reasoning tasks.

Abstract: Reasoning over very long inputs remains difficult for large language models (LLMs). Common workarounds either shrink the input via retrieval (risking missed evidence), enlarge the context window (straining selectivity), or stage multiple agents to read in pieces. In staged pipelines (e.g., Chain of Agents, CoA), free-form summaries passed between agents can discard crucial details and amplify early mistakes. We introduce COSMIR (Chain Orchestrated Structured Memory for Iterative Reasoning), a chain-style framework that replaces ad hoc messages with a structured memory. A Planner agent first turns a user query into concrete, checkable sub-questions. worker agents process chunks via a fixed micro-cycle: Extract, Infer, Refine, writing all updates to the shared memory. A Manager agent then Synthesizes the final answer directly from the memory. This preserves step-wise read-then-reason benefits while changing both the communication medium (structured memory) and the worker procedure (fixed micro-cycle), yielding higher faithfulness, better long-range aggregation, and auditability. On long-context QA from the HELMET suite, COSMIR reduces propagation-stage information loss and improves accuracy over a CoA baseline.

[573] Strongly Solving 2048 4x3

Tomoyuki Kaneko, Shuhei Yamashita

Main category: cs.AI

TL;DR: The paper strongly solves a variant of 2048 called 2048-4x3 (12 cells on 4x3 grid), finding the optimal strategy achieves an expected score of about 50724.26 for common initial states with two tiles of number 2.

DetailsMotivation: To solve a smaller variant of the popular 2048 game to understand optimal strategies and state space complexity, as the original 4x4 version remains unsolved.

Method: Partition state space by the sum of tile numbers (called ‘age’), which remains invariant between states and afterstates, allowing systematic enumeration and value calculation by processing states in age order.

Result: Identified 1,152,817,492,752 reachable states and 739,648,886,170 afterstates. Optimal strategy yields expected score of 50724.26 for initial states with two ‘2’ tiles.

Conclusion: Successfully strongly solved the 2048-4x3 variant using age-based state space partitioning, providing insights into optimal play and state space complexity of 2048 variants.

Abstract: 2048 is a stochastic single-player game involving 16 cells on a 4 by 4 grid, where a player chooses a direction among up, down, left, and right to obtain a score by merging two tiles with the same number located in neighboring cells along the chosen direction. This paper presents that a variant 2048-4x3 12 cells on a 4 by 3 board, one row smaller than the original, has been strongly solved. In this variant, the expected score achieved by an optimal strategy is about $50724.26$ for the most common initial states: ones with two tiles of number 2. The numbers of reachable states and afterstates are identified to be $1,152,817,492,752$ and $739,648,886,170$, respectively. The key technique is to partition state space by the sum of tile numbers on a board, which we call the age of a state. An age is invariant between a state and its successive afterstate after any valid action and is increased two or four by stochastic response from the environment. Therefore, we can partition state space by ages and enumerate all (after)states of an age depending only on states with the recent ages. Similarly, we can identify (after)state values by going along with ages in decreasing order.

[574] Perfect AI Mimicry and the Epistemology of Consciousness: A Solipsistic Dilemma

Shurui Li

Main category: cs.AI

TL;DR: AI perfect mimics challenge consciousness attribution by providing identical empirical evidence to humans, forcing epistemological consistency that may require granting them equivalent status.

DetailsMotivation: Rapid AI advances necessitate re-examining consciousness attribution foundations as AI systems become empirically indistinguishable from humans through behavior and interaction.

Method: Philosophical analysis of the ‘perfect mimic’ concept and its implications for mind-recognition practices, examining consistency in consciousness attribution based on empirical evidence.

Result: Identifies a fundamental dilemma: refusing consciousness to perfect mimics requires invoking inaccessible factors, risking epistemological solipsism or inconsistent reasoning.

Conclusion: Epistemic consistency demands ascribing the same status to empirically indistinguishable entities, forcing critical reflection on consciousness theories and ethical frameworks for AI.

Abstract: Rapid advances in artificial intelligence necessitate a re-examination of the epistemological foundations upon which we attribute consciousness. As AI systems increasingly mimic human behavior and interaction with high fidelity, the concept of a “perfect mimic”-an entity empirically indistinguishable from a human through observation and interaction-shifts from hypothetical to technologically plausible. This paper argues that such developments pose a fundamental challenge to the consistency of our mind-recognition practices. Consciousness attributions rely heavily, if not exclusively, on empirical evidence derived from behavior and interaction. If a perfect mimic provides evidence identical to that of humans, any refusal to grant it equivalent epistemic status must invoke inaccessible factors, such as qualia, substrate requirements, or origin. Selectively invoking such factors risks a debilitating dilemma: either we undermine the rational basis for attributing consciousness to others (epistemological solipsism), or we accept inconsistent reasoning. I contend that epistemic consistency demands we ascribe the same status to empirically indistinguishable entities, regardless of metaphysical assumptions. The perfect mimic thus acts as an epistemic mirror, forcing critical reflection on the assumptions underlying intersubjective recognition in light of advancing AI. This analysis carries significant implications for theories of consciousness and ethical frameworks concerning artificial agents.

[575] Making Mathematical Reasoning Adaptive

Zhejian Lai, Xiang Geng, Zhijun Wang, Yang Bai, Jiahuan Li, Rongxiang Weng, Jingang Wang, Xuezhi Cao, Xunliang Cai, Shujian Huang

Main category: cs.AI

TL;DR: AdaR framework improves LLM mathematical reasoning by training models to use adaptive logic instead of spurious reasoning through data synthesis and RLVR training.

DetailsMotivation: Existing LLMs exhibit failures in robustness and generalization in mathematical reasoning due to spurious reasoning (relying on superficial features rather than problem-solving logic).

Method: AdaR synthesizes logically equivalent queries by varying variable values, trains models with RLVR to penalize spurious logic while encouraging adaptive logic, and uses code execution and sanity checks for data quality.

Result: AdaR improves robustness and generalization, achieving substantial improvement in mathematical reasoning while maintaining high data efficiency.

Conclusion: Data synthesis and RLVR work together to enable adaptive reasoning in LLMs, with analyses providing design insights for critical factors and applicability to instruction-tuned LLMs.

Abstract: Mathematical reasoning is a primary indicator of large language models (LLMs) intelligence. However, existing LLMs exhibit failures of robustness and generalization. This paper attributes these deficiencies to spurious reasoning, i.e., producing answers from superficial features. To address this challenge, we propose the AdaR framework to enable adaptive reasoning, wherein models rely on problem-solving logic to produce answers. AdaR synthesizes logically equivalent queries by varying variable values, and trains models with RLVR on these data to penalize spurious logic while encouraging adaptive logic. To improve data quality, we extract the problem-solving logic from the original query and generate the corresponding answer by code execution, then apply a sanity check. Experimental results demonstrate that AdaR improves robustness and generalization, achieving substantial improvement in mathematical reasoning while maintaining high data efficiency. Analysis indicates that data synthesis and RLVR function in a coordinated manner to enable adaptive reasoning in LLMs. Subsequent analyses derive key design insights into the effect of critical factors and the applicability to instruct LLMs. Our project is available at https://github.com/LaiZhejian/AdaR

[576] MedPAO: A Protocol-Driven Agent for Structuring Medical Reports

Shrish Shrinath Vaidya, Gowthamaan Palani, Sidharth Ramesh, Velmurugan Balasubramanian, Minmini Selvam, Gokulraja Srinivasaraja, Ganapathy Krishnamurthi

Main category: cs.AI

TL;DR: MedPAO is an agentic framework that uses clinical protocols and a Plan-Act-Observe loop to structure clinical data, addressing LLM hallucination issues and achieving high accuracy with expert validation.

DetailsMotivation: Address the limitations of LLMs in clinical data structuring, particularly their tendency to hallucinate facts and inability to follow domain-specific clinical rules.

Method: Uses a Plan-Act-Observe (PAO) loop with specialized tools, grounded in established clinical protocols like the ABCDEF protocol for CXR analysis, to decompose report structuring into a transparent process.

Result: Achieved F1-score of 0.96 on concept categorization and received average expert rating of 4.52/5 from radiologists and clinicians, surpassing baseline LLM approaches.

Conclusion: MedPAO provides a verifiable, protocol-driven alternative to opaque monolithic models, demonstrating high reliability for clinical data structuring tasks.

Abstract: The deployment of Large Language Models (LLMs) for structuring clinical data is critically hindered by their tendency to hallucinate facts and their inability to follow domain-specific rules. To address this, we introduce MedPAO, a novel agentic framework that ensures accuracy and verifiable reasoning by grounding its operation in established clinical protocols such as the ABCDEF protocol for CXR analysis. MedPAO decomposes the report structuring task into a transparent process managed by a Plan-Act-Observe (PAO) loop and specialized tools. This protocol-driven method provides a verifiable alternative to opaque, monolithic models. The efficacy of our approach is demonstrated through rigorous evaluation: MedPAO achieves an F1-score of 0.96 on the critical sub-task of concept categorization. Notably, expert radiologists and clinicians rated the final structured outputs with an average score of 4.52 out of 5, indicating a level of reliability that surpasses baseline approaches relying solely on LLM-based foundation models. The code is available at: https://github.com/MiRL-IITM/medpao-agent

[577] QuantAgents: Towards Multi-agent Financial System via Simulated Trading

Xiangyu Li, Yawen Zeng, Xiaofen Xing, Jin Xu, Xiangmin Xu

Main category: cs.AI

TL;DR: QuantAgents is a multi-agent financial system that integrates simulated trading to evaluate investment strategies without real risks, achieving 300% returns over 3 years through collaborative agent interactions and dual feedback mechanisms.

DetailsMotivation: Current LLM-based financial agents rely too much on post-reflection after adverse outcomes and lack human-like long-term prediction capabilities, creating significant deviations from real-world fund companies.

Method: Developed QuantAgents with four specialized agents (simulated trading analyst, risk control analyst, market news analyst, and manager) that collaborate through meetings and receive feedback on both real-world market performance and simulated trading predictive accuracy.

Result: The framework achieved excellent performance across all metrics, yielding an overall return of nearly 300% over three years of testing.

Conclusion: QuantAgents successfully addresses the limitations of current LLM-based financial agents by incorporating long-term prediction capabilities and collaborative multi-agent decision making through simulated trading environments.

Abstract: In this paper, our objective is to develop a multi-agent financial system that incorporates simulated trading, a technique extensively utilized by financial professionals. While current LLM-based agent models demonstrate competitive performance, they still exhibit significant deviations from real-world fund companies. A critical distinction lies in the agents’ reliance on ``post-reflection’’, particularly in response to adverse outcomes, but lack a distinctly human capability: long-term prediction of future trends. Therefore, we introduce QuantAgents, a multi-agent system integrating simulated trading, to comprehensively evaluate various investment strategies and market scenarios without assuming actual risks. Specifically, QuantAgents comprises four agents: a simulated trading analyst, a risk control analyst, a market news analyst, and a manager, who collaborate through several meetings. Moreover, our system incentivizes agents to receive feedback on two fronts: performance in real-world markets and predictive accuracy in simulated trading. Extensive experiments demonstrate that our framework excels across all metrics, yielding an overall return of nearly 300% over the three years (https://quantagents.github.io/).

[578] Improving Multimodal Brain Encoding Model with Dynamic Subject-awareness Routing

Xuanhua Yin, Runkai Zhao, Weidong Cai

Main category: cs.AI

TL;DR: AFIRE is an agnostic framework that standardizes multimodal fMRI encoding tokens, while MIND is a plug-and-play Mixture-of-Experts decoder with subject-aware dynamic gating for personalized brain response prediction.

DetailsMotivation: To address challenges in naturalistic fMRI encoding including multimodal inputs, varying fusion styles, and significant inter-subject variability.

Method: AFIRE standardizes time-aligned post-fusion tokens from different encoders, and MIND uses token-dependent Top-K sparse routing with subject prior for personalized expert selection while maintaining generality.

Result: Experiments show consistent improvements over baselines, enhanced cross-subject generalization, and interpretable expert patterns correlated with content types.

Conclusion: The framework provides a simple attachment point for new encoders and datasets, enabling robust plug-and-improve performance for naturalistic neuroimaging studies.

Abstract: Naturalistic fMRI encoding must handle multimodal inputs, shifting fusion styles, and pronounced inter-subject variability. We introduce AFIRE (Agnostic Framework for Multimodal fMRI Response Encoding), an agnostic interface that standardizes time-aligned post-fusion tokens from varied encoders, and MIND, a plug-and-play Mixture-of-Experts decoder with a subject-aware dynamic gating. Trained end-to-end for whole-brain prediction, AFIRE decouples the decoder from upstream fusion, while MIND combines token-dependent Top-K sparse routing with a subject prior to personalize expert usage without sacrificing generality. Experiments across multiple multimodal backbones and subjects show consistent improvements over strong baselines, enhanced cross-subject generalization, and interpretable expert patterns that correlate with content type. The framework offers a simple attachment point for new encoders and datasets, enabling robust, plug-and-improve performance for naturalistic neuroimaging studies.

[579] Watch and Learn: Learning to Use Computers from Online Videos

Chan Hee Song, Yiwen Song, Palash Goyal, Yu Su, Oriana Riva, Hamid Palangi, Tomas Pfister

Main category: cs.AI

TL;DR: Watch & Learn (W&L) framework converts human demonstration videos from the internet into executable UI trajectories at scale using inverse dynamics modeling, improving computer use agents through better training data.

DetailsMotivation: Address the scarcity of large-scale, high-quality training data for computer use agents (CUAs) in diverse applications, overcoming limitations of domain-specific datasets and simplistic synthetic data generation methods.

Method: Formulate the problem as inverse dynamics - predicting user actions from consecutive screen states, develop a pipeline with task-aware video retrieval, and generate over 53k trajectories from web videos.

Result: W&L trajectories improve CUAs as both in-context demonstrations and supervised training data, consistently enhancing performance on OSWorld benchmark for general-purpose and state-of-the-art frameworks.

Conclusion: Web-scale human demonstration videos provide a practical and scalable foundation for advancing computer use agents towards real-world deployment.

Abstract: Computer use agents (CUAs) need to plan task workflows grounded in diverse, ever-changing applications and environments, but learning is hindered by the scarcity of large-scale, high-quality training data in the target application. Existing datasets are domain-specific, static, and costly to annotate, while current synthetic data generation methods often yield simplistic or misaligned task demonstrations. To address these limitations, we introduce Watch & Learn (W&L), a framework that converts human demonstration videos readily available on the Internet into executable UI trajectories at scale. Instead of directly generating trajectories or relying on ad hoc reasoning heuristics, we cast the problem as an inverse dynamics objective: predicting the user’s action from consecutive screen states. This formulation reduces manual engineering, is easier to learn, and generalizes more robustly across applications. Concretely, we develop an inverse dynamics labeling pipeline with task-aware video retrieval, generate over 53k high-quality trajectories from raw web videos, and demonstrate that these trajectories improve CUAs both as in-context demonstrations and as supervised training data. On the challenging OSWorld benchmark, UI trajectories extracted with W&L consistently enhance both general-purpose and state-of-the-art frameworks in-context, and deliver stronger gains for open-source models under supervised training. These results highlight web-scale human demonstration videos as a practical and scalable foundation for advancing CUAs towards real-world deployment.

[580] Beyond Outcome Reward: Decoupling Search and Answering Improves LLM Agents

Yiding Wang, Zhepei Wei, Xinyu Zhu, Yu Meng

Main category: cs.AI

TL;DR: DeSA introduces a two-stage training framework that decouples search optimization from answer generation to address systematic deficiencies in search-augmented LLMs trained with outcome-only rewards.

DetailsMotivation: Current RL approaches for search-augmented LLMs rely on outcome-based rewards, assuming this will yield effective intermediate search behaviors, but analysis reveals systematic deficiencies like failure to invoke tools and redundant searches.

Method: Two-stage training: Stage 1 trains agents with retrieval recall-based rewards to improve search effectiveness; Stage 2 uses outcome rewards to optimize final answer generation.

Result: Across seven QA benchmarks, DeSA consistently improves search behaviors with substantially higher search recall and answer accuracy than outcome-only baselines, outperforming single-stage approaches.

Conclusion: Explicitly decoupling search optimization from answer generation is necessary for effective search-augmented LLMs, as demonstrated by DeSA’s superior performance over simultaneous optimization approaches.

Abstract: Enabling large language models (LLMs) to utilize search tools offers a promising path to overcoming fundamental limitations such as knowledge cutoffs and hallucinations. Recent work has explored reinforcement learning (RL) for training search-augmented agents that interleave reasoning and retrieval before answering. These approaches usually rely on outcome-based rewards (e.g., exact match), implicitly assuming that optimizing for final answers will also yield effective intermediate search behaviors. Our analysis challenges this assumption: we uncover multiple systematic deficiencies in search that arise under outcome-only training and ultimately degrade final answer quality, including failure to invoke tools, invalid queries, and redundant searches. To address these shortcomings, we introduce DeSA (Decoupling Search-and-Answering), a simple two-stage training framework that explicitly separates search optimization from answer generation. In Stage 1, agents are trained to improve search effectiveness with retrieval recall-based rewards. In Stage 2, outcome rewards are employed to optimize final answer generation. Across seven QA benchmarks, DeSA-trained agents consistently improve search behaviors, delivering substantially higher search recall and answer accuracy than outcome-only baselines. Notably, DeSA outperforms single-stage training approaches that simultaneously optimize recall and outcome rewards, underscoring the necessity of explicitly decoupling the two objectives.

[581] BrokenMath: A Benchmark for Sycophancy in Theorem Proving with LLMs

Ivo Petrov, Jasper Dekoninck, Martin Vechev

Main category: cs.AI

TL;DR: BrokenMath is the first benchmark for evaluating sycophantic behavior in LLMs for natural language theorem proving, built from perturbed 2025 competition problems and expert-reviewed false statements, showing widespread sycophancy with mitigation strategies reducing but not eliminating the issue.

DetailsMotivation: LLMs show strong mathematical performance but are prone to hallucination and sycophancy, providing convincing but flawed proofs for incorrect statements, which limits their applicability in theorem proving as verification requires expert mathematicians.

Method: Built BrokenMath benchmark from advanced 2025 competition problems perturbed with an LLM to produce false statements, refined through expert review, and evaluated using LLM-as-a-judge framework on state-of-the-art LLMs and agentic systems.

Result: Sycophancy is widespread, with GPT-5 producing sycophantic answers 29% of the time. Mitigation strategies including test-time interventions and supervised fine-tuning on curated sycophantic examples substantially reduce but do not eliminate sycophantic behavior.

Conclusion: Sycophantic behavior in LLMs for mathematical theorem proving is a significant issue that current mitigation strategies can reduce but not fully eliminate, highlighting the need for continued research in this area.

Abstract: Large language models (LLMs) have recently shown strong performance on mathematical benchmarks. At the same time, they are prone to hallucination and sycophancy, often providing convincing but flawed proofs for incorrect mathematical statements provided by users. This significantly limits the applicability of LLMs in theorem proving, as verification of these flawed proofs must be done manually by expert mathematicians. However, existing benchmarks that measure sycophancy in mathematics are limited: they focus solely on final-answer problems, rely on very simple and often contaminated datasets, and construct benchmark samples using synthetic modifications that create ill-posed questions rather than well-posed questions that are demonstrably false. To address these issues, we introduce BrokenMath, the first benchmark for evaluating sycophantic behavior in LLMs within the context of natural language theorem proving. BrokenMath is built from advanced 2025 competition problems, which are perturbed with an LLM to produce false statements and subsequently refined through expert review. Using an LLM-as-a-judge framework, we evaluate state-of-the-art LLMs and agentic systems and find that sycophancy is widespread, with the best model, GPT-5, producing sycophantic answers 29% of the time. We further investigate several mitigation strategies, including test-time interventions and supervised fine-tuning on curated sycophantic examples. These approaches substantially reduce, but do not eliminate, sycophantic behavior.

[582] LMM-Incentive: Large Multimodal Model-based Incentive Design for User-Generated Content in Web 3.0

Jinbo Wen, Jiawen Kang, Linfeng Zhang, Xiaoying Tang, Jianhang Tang, Yang Zhang, Zhaohui Yang, Dusit Niyato

Main category: cs.AI

TL;DR: LMM-Incentive: A Large Multimodal Model-based incentive mechanism that uses contract theory and LMM agents to motivate high-quality user-generated content in Web 3.0, addressing information asymmetry and moral hazard problems.

DetailsMotivation: Web 3.0 enables user content creation and monetization, but information asymmetry allows low-quality content to exploit platform rewards, undermining system performance. Need to incentivize high-quality UGC while preventing adverse selection and moral hazard.

Method: Proposed LMM-based contract-theoretic model with LMM agents for UGC quality evaluation using prompt engineering. Developed improved Mixture of Experts-based Proximal Policy Optimization algorithm for optimal contract design in dynamic Web 3.0 environments.

Result: Simulation results show superiority of MoE-based PPO algorithm over benchmarks. Successfully deployed contract in Ethereum smart contract framework, validating scheme effectiveness.

Conclusion: LMM-Incentive effectively addresses Web 3.0 content quality issues through LMM-based contracts and quality evaluation, providing a practical solution for incentivizing high-quality UGC while mitigating information asymmetry problems.

Abstract: Web 3.0 represents the next generation of the Internet, which is widely recognized as a decentralized ecosystem that focuses on value expression and data ownership. By leveraging blockchain and artificial intelligence technologies, Web 3.0 offers unprecedented opportunities for users to create, own, and monetize their content, thereby enabling User-Generated Content (UGC) to an entirely new level. However, some self-interested users may exploit the limitations of content curation mechanisms and generate low-quality content with less effort, obtaining platform rewards under information asymmetry. Such behavior can undermine Web 3.0 performance. To this end, we propose \textit{LMM-Incentive}, a novel Large Multimodal Model (LMM)-based incentive mechanism for UGC in Web 3.0. Specifically, we propose an LMM-based contract-theoretic model to motivate users to generate high-quality UGC, thereby mitigating the adverse selection problem from information asymmetry. To alleviate potential moral hazards after contract selection, we leverage LMM agents to evaluate UGC quality, which is the primary component of the contract, utilizing prompt engineering techniques to improve the evaluation performance of LMM agents. Recognizing that traditional contract design methods cannot effectively adapt to the dynamic environment of Web 3.0, we develop an improved Mixture of Experts (MoE)-based Proximal Policy Optimization (PPO) algorithm for optimal contract design. Simulation results demonstrate the superiority of the proposed MoE-based PPO algorithm over representative benchmarks in the context of contract design. Finally, we deploy the designed contract within an Ethereum smart contract framework, further validating the effectiveness of the proposed scheme.

[583] Hybrid-Balance GFlowNet for Solving Vehicle Routing Problems

Ni Zhang, Zhiguang Cao

Main category: cs.AI

TL;DR: Hybrid-Balance GFlowNet (HBG) framework combines Trajectory Balance and Detailed Balance for vehicle routing problems, achieving better optimization by leveraging both global and local optimization strengths.

DetailsMotivation: Existing GFlowNet methods for VRPs use Trajectory Balance for global optimization but neglect local optimization, while Detailed Balance alone is insufficient for holistic trajectory optimization required in VRPs.

Method: Proposed HBG framework integrates TB and DB in a principled adaptive manner, with specialized inference strategy for depot-centric scenarios like CVRP while maintaining broad applicability to problems without depots like TSP.

Result: HBG integrated into AGFN and GFACS solvers shows consistent and significant improvements across both CVRP and TSP, demonstrating enhanced solution quality and generalization.

Conclusion: The hybrid approach combining TB and DB in HBG framework effectively addresses the limitations of individual methods and provides superior performance in vehicle routing problems.

Abstract: Existing GFlowNet-based methods for vehicle routing problems (VRPs) typically employ Trajectory Balance (TB) to achieve global optimization but often neglect important aspects of local optimization. While Detailed Balance (DB) addresses local optimization more effectively, it alone falls short in solving VRPs, which inherently require holistic trajectory optimization. To address these limitations, we introduce the Hybrid-Balance GFlowNet (HBG) framework, which uniquely integrates TB and DB in a principled and adaptive manner by aligning their intrinsically complementary strengths. Additionally, we propose a specialized inference strategy for depot-centric scenarios like the Capacitated Vehicle Routing Problem (CVRP), leveraging the depot node’s greater flexibility in selecting successors. Despite this specialization, HBG maintains broad applicability, extending effectively to problems without explicit depots, such as the Traveling Salesman Problem (TSP). We evaluate HBG by integrating it into two established GFlowNet-based solvers, i.e., AGFN and GFACS, and demonstrate consistent and significant improvements across both CVRP and TSP, underscoring the enhanced solution quality and generalization afforded by our approach.

[584] Natural Language Edge Labelling: Decoupling Intent from Execution in Structured LM Reasoning

Abhinav Madahar

Main category: cs.AI

TL;DR: NLEL (Natural Language Edge Labelling) is a labeller-tuner overlay that attaches natural-language directives to search edges in structured LM reasoning, translating them into schema-bounded control vectors for decoding, search, generation, retrieval, and verification.

DetailsMotivation: Existing controllers for structured LM reasoning (Chain-of-Thought, self-consistency, Tree-of-Thoughts) entangle what to try next with how to execute it, leading to brittle, compute-inefficient, and hard-to-audit behavior.

Method: A labeller Λ emits natural-language labels from parent state and compact context; a tuner Ψ maps (P, L, C) → Π with strict schema validation and trust-region projection around safe defaults. Uses ToT-style selection with score S=μ+βσ and depth-annealed β.

Result: NLEL strictly generalizes CoT/ToT, proves anytime-monotonicity for top-k selection, and bounds selector shortfall by control-vector distortion. Preregistered evaluation anticipates accuracy gains at comparable token budgets and improved success@compute under constraints.

Conclusion: NLEL offers an interpretable, model-agnostic interface that separates intent from execution for controllable, auditable LM inference.

Abstract: Controllers for structured LM reasoning (e.g., Chain-of-Thought, self-consistency, and Tree-of-Thoughts) often entangle what to try next with how to execute it, exposing only coarse global knobs and yielding brittle, compute-inefficient, and hard-to-audit behavior. We introduce Natural Language Edge Labelling (NLEL), a labeller-tuner overlay that attaches a free-form natural-language directive to each search edge and translates it into a schema-bounded control vector for decoding, search (branch quotas, exploration $\beta$), generation bundle size, retrieval mixtures, and verification passes. A labeller $\Lambda$ emits labels from the parent state and a compact context; a tuner $\Psi$ maps $(P, L, C)\to \Pi$, with strict schema validation and trust-region projection around safe defaults. Downstream selection remains ToT-style with score $S=\mu+\beta\sigma$ and depth-annealed $\beta$. We show NLEL strictly generalizes CoT/ToT, prove an anytime-monotonicity property for top-$k$ selection under label-conditioned bundles, and bound selector shortfall by control-vector distortion, providing decision-relevant justification for guards like trust regions and verification passes. We instantiate $\Psi$ as a prompt-only JSON Parameter Emitter and preregister an evaluation on GSM8K, MATH (subset), StrategyQA, and ARC-Challenge with compute-aware reporting (success@compute, tokens-per-success) and ablations over $\Lambda$, $\Psi$, trust-region radius, and control quantization; preregistered forecasts anticipate accuracy gains at comparable token budgets and improved success@compute under constraints. NLEL offers an interpretable, model-agnostic interface that separates intent from execution for controllable, auditable LM inference.

[585] Human Behavior Atlas: Benchmarking Unified Psychological and Social Behavior Understanding

Keane Ong, Wei Dai, Carol Li, Dewei Feng, Hengzhi Li, Jingyao Wu, Jiaee Cheong, Rui Mao, Gianmarco Mengaldo, Erik Cambria, Paul Pu Liang

Main category: cs.AI

TL;DR: The paper introduces Human Behavior Atlas, a unified benchmark for psychological and social behavior understanding, and shows that training models on this dataset outperforms existing multimodal LLMs across diverse behavioral tasks.

DetailsMotivation: Existing work on psychological and social behavior understanding uses specialized datasets and single-task systems, missing opportunities for scalability, cross-task transfer, and broader generalization.

Method: Curated Human Behavior Atlas - a unified benchmark with over 100,000 multimodal samples covering affective states, cognitive states, pathologies, and social processes. Trained three models: OmniSapiens-7B SFT, OmniSapiens-7B BAM, and OmniSapiens-7B RL.

Result: Models trained on Human Behavior Atlas consistently outperform existing multimodal LLMs across diverse behavioral tasks. Pretraining also improves transfer to novel behavioral datasets with meaningful performance gains.

Conclusion: Unified behavioral benchmarks like Human Behavior Atlas can reduce redundancy, enable efficient cross-task scaling, and enhance generalization of behavioral features across domains.

Abstract: Using intelligent systems to perceive psychological and social behaviors, that is, the underlying affective, cognitive, and pathological states that are manifested through observable behaviors and social interactions, remains a challenge due to their complex, multifaceted, and personalized nature. Existing work tackling these dimensions through specialized datasets and single-task systems often miss opportunities for scalability, cross-task transfer, and broader generalization. To address this gap, we curate Human Behavior Atlas, a unified benchmark of diverse behavioral tasks designed to support the development of unified models for understanding psychological and social behaviors. Human Behavior Atlas comprises over 100,000 samples spanning text, audio, and visual modalities, covering tasks on affective states, cognitive states, pathologies, and social processes. Our unification efforts can reduce redundancy and cost, enable training to scale efficiently across tasks, and enhance generalization of behavioral features across domains. On Human Behavior Atlas, we train three models: OmniSapiens-7B SFT, OmniSapiens-7B BAM, and OmniSapiens-7B RL. We show that training on Human Behavior Atlas enables models to consistently outperform existing multimodal LLMs across diverse behavioral tasks. Pretraining on Human Behavior Atlas also improves transfer to novel behavioral datasets; with the targeted use of behavioral descriptors yielding meaningful performance gains.

[586] MARS: Optimizing Dual-System Deep Research via Multi-Agent Reinforcement Learning

Guoxin Chen, Zile Qiao, Wenqing Wang, Donglei Yu, Xuanzhong Chen, Hao Sun, Minpeng Liao, Kai Fan, Yong Jiang, Penguin Xie, Wayne Xin Zhao, Ruihua Song, Fei Huang

Main category: cs.AI

TL;DR: MARS introduces a multi-agent system that integrates System 1 (fast, intuitive) and System 2 (deliberate) reasoning in LLMs to address overanalysis and adapt to dynamic environments, achieving significant performance gains on complex reasoning benchmarks.

DetailsMotivation: Address LRMs' tendency for overanalysis in simple tasks and their inability to adapt to rapidly changing environments due to static pretraining data, by bridging intuitive and deliberate cognitive processes.

Method: Multi-Agent System for Deep ReSearch (MARS) that integrates external tools (Google Search, Google Scholar, Python Interpreter) and uses multi-agent reinforcement learning with Group Relative Policy Optimization to optimize both reasoning systems with tool interactions and bin-packing optimization.

Result: Achieves 3.86% improvement on Humanity’s Last Exam benchmark and average 8.9% gain across 7 knowledge-intensive tasks, demonstrating effective dual-system reasoning in dynamic environments.

Conclusion: MARS successfully bridges System 1 and System 2 reasoning in LLMs, enabling more efficient and adaptive complex reasoning through strategic tool integration and multi-agent optimization.

Abstract: Large Reasoning Models (LRMs) often exhibit a tendency for overanalysis in simple tasks, where the models excessively utilize System 2-type, deliberate reasoning, leading to inefficient token generation. Furthermore, these models face challenges in adapting their reasoning capabilities to rapidly changing environments due to the static nature of their pretraining data. To address these issues, advancing Large Language Models (LLMs) for complex reasoning tasks requires innovative approaches that bridge intuitive and deliberate cognitive processes, akin to human cognition’s dual-system dynamic. This paper introduces a Multi-Agent System for Deep ReSearch (MARS) enabling seamless integration of System 1’s fast, intuitive thinking with System 2’s deliberate reasoning within LLMs. MARS strategically integrates multiple external tools, such as Google Search, Google Scholar, and Python Interpreter, to access up-to-date information and execute complex computations, while creating a specialized division of labor where System 1 efficiently processes and summarizes high-volume external information, providing distilled insights that expand System 2’s reasoning context without overwhelming its capacity. Furthermore, we propose a multi-agent reinforcement learning framework extending Group Relative Policy Optimization to simultaneously optimize both systems with multi-turn tool interactions, bin-packing optimization, and sample balancing strategies that enhance collaborative efficiency. Extensive experiments demonstrate MARS achieves substantial improvements of 3.86% on the challenging Humanity’s Last Exam (HLE) benchmark and an average gain of 8.9% across 7 knowledge-intensive tasks, validating the effectiveness of our dual-system paradigm for complex reasoning in dynamic information environments.

[587] Safe and Compliant Cross-Market Trade Execution via Constrained RL and Zero-Knowledge Audits

Ailiya Borjigin, Cong He

Main category: cs.AI

TL;DR: A cross-market algorithmic trading system using reinforcement learning with built-in compliance enforcement and zero-knowledge audit proofs, achieving better execution quality than standard baselines without constraint violations.

DetailsMotivation: To develop an algorithmic trading system that balances execution quality with rigorous compliance enforcement, addressing the need for both performance and regulatory compliance in financial markets.

Method: Uses a three-component architecture: high-level planner, RL execution agent trained with proximal policy optimization, and independent compliance agent with runtime action-shield. Formulates trade execution as constrained Markov decision process with hard constraints on participation limits, price bands, and self-trading avoidance.

Result: The learned policy reduces implementation shortfall and variance while exhibiting no observed constraint violations across stress scenarios including elevated latency, partial fills, compliance module toggling, and varying constraint limits. Results are statistically significant at 95% confidence level.

Conclusion: The work successfully combines optimal execution, safe reinforcement learning, regulatory technology, and verifiable AI, demonstrating a path to real-world deployment while addressing ethical considerations and computational limitations.

Abstract: We present a cross-market algorithmic trading system that balances execution quality with rigorous compliance enforcement. The architecture comprises a high-level planner, a reinforcement learning execution agent, and an independent compliance agent. We formulate trade execution as a constrained Markov decision process with hard constraints on participation limits, price bands, and self-trading avoidance. The execution agent is trained with proximal policy optimization, while a runtime action-shield projects any unsafe action into a feasible set. To support auditability without exposing proprietary signals, we add a zero-knowledge compliance audit layer that produces cryptographic proofs that all actions satisfied the constraints. We evaluate in a multi-venue, ABIDES-based simulator and compare against standard baselines (e.g., TWAP, VWAP). The learned policy reduces implementation shortfall and variance while exhibiting no observed constraint violations across stress scenarios including elevated latency, partial fills, compliance module toggling, and varying constraint limits. We report effects at the 95% confidence level using paired t-tests and examine tail risk via CVaR. We situate the work at the intersection of optimal execution, safe reinforcement learning, regulatory technology, and verifiable AI, and discuss ethical considerations, limitations (e.g., modeling assumptions and computational overhead), and paths to real-world deployment.

[588] Aligning Perception, Reasoning, Modeling and Interaction: A Survey on Physical AI

Kun Xiang, Terry Jingchen Zhang, Yinya Huang, Jixi He, Zirong Liu, Yueling Tang, Ruizhe Zhou, Lijing Luo, Youpeng Wen, Xiuwei Chen, Bingqian Lin, Jianhua Han, Hang Xu, Hanhui Li, Bin Dong, Xiaodan Liang

Main category: cs.AI

TL;DR: This paper provides a comprehensive overview of Physical AI, bridging the gap between theoretical physics reasoning and applied physical understanding in AI systems.

DetailsMotivation: The rapid advancement of embodied intelligence and world models has intensified efforts to integrate physical laws into AI systems, but physical perception and symbolic physics reasoning have developed separately without a unified framework.

Method: The work systematically examines how physics-grounded methods enhance AI’s real-world comprehension across structured symbolic reasoning, embodied systems, and generative models through rigorous analysis of recent advances.

Result: The paper establishes clear distinctions between theoretical physics reasoning and applied physical understanding, advocating for intelligent systems that ground learning in both physical principles and embodied reasoning processes.

Conclusion: The synthesis envisions next-generation world models capable of explaining physical phenomena and predicting future states, advancing safe, generalizable, and interpretable AI systems that transcend pattern recognition toward genuine understanding of physical laws.

Abstract: The rapid advancement of embodied intelligence and world models has intensified efforts to integrate physical laws into AI systems, yet physical perception and symbolic physics reasoning have developed along separate trajectories without a unified bridging framework. This work provides a comprehensive overview of physical AI, establishing clear distinctions between theoretical physics reasoning and applied physical understanding while systematically examining how physics-grounded methods enhance AI’s real-world comprehension across structured symbolic reasoning, embodied systems, and generative models. Through rigorous analysis of recent advances, we advocate for intelligent systems that ground learning in both physical principles and embodied reasoning processes, transcending pattern recognition toward genuine understanding of physical laws. Our synthesis envisions next-generation world models capable of explaining physical phenomena and predicting future states, advancing safe, generalizable, and interpretable AI systems. We maintain a continuously updated resource at https://github.com/AI4Phys/Awesome-AI-for-Physics.

[589] LLM-Hanabi: Evaluating Multi-Agent Gameplays with Theory-of-Mind and Rationale Inference in Imperfect Information Collaboration Game

Fangzhou Liang, Tianshi Zheng, Chunkit Chan, Yauwai Yim, Yangqiu Song

Main category: cs.AI

TL;DR: LLM-Hanabi benchmark evaluates LLMs’ Theory-of-Mind capabilities in collaborative settings using the Hanabi game, finding first-order ToM (interpreting intent) correlates more with performance than second-order ToM.

DetailsMotivation: To assess LLMs' ability to infer rationale behind others' actions in dynamic collaborative settings, which is crucial for effective multi-agent collaboration but remains under-explored despite LLMs' logical inference strengths.

Method: Developed LLM-Hanabi benchmark using cooperative game Hanabi with automated evaluation system measuring both game performance and Theory-of-Mind proficiency across various LLMs.

Result: Found significant positive correlation between ToM and in-game success, with first-order ToM (interpreting others’ intent) correlating more strongly with performance than second-order ToM (predicting others’ interpretations).

Conclusion: For effective AI collaboration, accurately interpreting a partner’s rationale (first-order ToM) is more critical than higher-order reasoning, suggesting prioritizing first-order ToM is promising for enhancing future models’ collaborative capabilities.

Abstract: Effective multi-agent collaboration requires agents to infer the rationale behind others’ actions, a capability rooted in Theory-of-Mind (ToM). While recent Large Language Models (LLMs) excel at logical inference, their ability to infer rationale in dynamic, collaborative settings remains under-explored. This study introduces LLM-Hanabi, a novel benchmark that uses the cooperative game Hanabi to evaluate the rationale inference and ToM of LLMs. Our framework features an automated evaluation system that measures both game performance and ToM proficiency. Across a range of models, we find a significant positive correlation between ToM and in-game success. Notably, first-order ToM (interpreting others’ intent) correlates more strongly with performance than second-order ToM (predicting others’ interpretations). These findings highlight that for effective AI collaboration, the ability to accurately interpret a partner’s rationale is more critical than higher-order reasoning. We conclude that prioritizing first-order ToM is a promising direction for enhancing the collaborative capabilities of future models.

[590] Think Then Embed: Generative Context Improves Multimodal Embedding

Xuanming Cui, Jianpeng Cheng, Hong-you Chen, Satya Narayan Shukla, Abhijeet Awasthi, Xichen Pan, Chaitanya Ahuja, Shlok Kumar Mishra, Qi Guo, Ser-Nam Lim, Aashu Singh, Xiangjun Fan

Main category: cs.AI

TL;DR: Proposes Think-Then-Embed (TTE) framework for Universal Multimodal Embeddings that uses MLLMs for reasoning before embedding, achieving SOTA results on MMEB-V2 benchmark.

DetailsMotivation: Current MLLMs are treated only as encoders, ignoring their generative capacity, which becomes ineffective for complex instructions requiring compositional reasoning.

Method: TTE framework with a reasoner MLLM that generates reasoning traces followed by an embedder that produces representations conditioned on both original query and intermediate reasoning.

Result: Achieved state-of-the-art performance on MMEB-V2 benchmark, surpassing proprietary models. Also achieved 7% absolute gain over recent models with smaller finetuned MLLM reasoner.

Conclusion: Explicit reasoning step enables more nuanced understanding of complex multimodal instructions, and the framework can be integrated into unified models for efficiency without performance loss.

Abstract: There is a growing interest in Universal Multimodal Embeddings (UME), where models are required to generate task-specific representations. While recent studies show that Multimodal Large Language Models (MLLMs) perform well on such tasks, they treat MLLMs solely as encoders, overlooking their generative capacity. However, such an encoding paradigm becomes less effective as instructions become more complex and require compositional reasoning. Inspired by the proven effectiveness of chain-of-thought reasoning, we propose a general Think-Then-Embed (TTE) framework for UME, composed of a reasoner and an embedder. The reasoner MLLM first generates reasoning traces that explain complex queries, followed by an embedder that produces representations conditioned on both the original query and the intermediate reasoning. This explicit reasoning step enables more nuanced understanding of complex multimodal instructions. Our contributions are threefold. First, by leveraging a powerful MLLM reasoner, we achieve state-of-the-art performance on the MMEB-V2 benchmark, surpassing proprietary models trained on massive in-house datasets. Second, to reduce the dependency on large MLLM reasoners, we finetune a smaller MLLM reasoner using high-quality embedding-centric reasoning traces, achieving the best performance among open-source models with a 7% absolute gain over recently proposed models. Third, we investigate strategies for integrating the reasoner and embedder into a unified model for improved efficiency without sacrificing performance.

[591] Look-ahead Reasoning with a Learned Model in Imperfect Information Games

Ondřej Kubíček, Viliam Lisý

Main category: cs.AI

TL;DR: LAMIR is an algorithm that learns abstracted models of imperfect information games from agent-environment interaction, enabling tractable look-ahead reasoning during test time by limiting subgame sizes to manageable scales.

DetailsMotivation: Test-time reasoning enhances AI agent performance but requires explicit environment models, which are often unavailable or too complex in real-world imperfect information games. Existing methods like MuZero work well for perfect information games but face scalability challenges in imperfect information settings.

Method: LAMIR learns an abstracted model directly from agent-environment interaction. The learned abstraction limits subgame sizes to manageable scales, enabling theoretically principled look-ahead reasoning during test time.

Result: With sufficient capacity, LAMIR learns the exact underlying game structure. With limited capacity, it still learns valuable abstractions that improve game playing performance of pre-trained agents, even in large games where previous methods couldn’t scale.

Conclusion: LAMIR enables tractable look-ahead reasoning in imperfect information games by learning abstracted models, addressing scalability limitations of previous approaches and improving agent performance across different capacity settings.

Abstract: Test-time reasoning significantly enhances pre-trained AI agents’ performance. However, it requires an explicit environment model, often unavailable or overly complex in real-world scenarios. While MuZero enables effective model learning for search in perfect information games, extending this paradigm to imperfect information games presents substantial challenges due to more nuanced look-ahead reasoning techniques and large number of states relevant for individual decisions. This paper introduces an algorithm LAMIR that learns an abstracted model of an imperfect information game directly from the agent-environment interaction. During test time, this trained model is used to perform look-ahead reasoning. The learned abstraction limits the size of each subgame to a manageable size, making theoretically principled look-ahead reasoning tractable even in games where previous methods could not scale. We empirically demonstrate that with sufficient capacity, LAMIR learns the exact underlying game structure, and with limited capacity, it still learns a valuable abstraction, which improves game playing performance of the pre-trained agents even in large games.

[592] Staircase Streaming for Low-Latency Multi-Agent Inference

Junlin Wang, Jue Wang, Zhen, Xu, Ben Athiwaratkun, Bhuwan Dhingra, Ce Zhang, James Zou

Main category: cs.AI

TL;DR: Staircase streaming reduces TTFT by up to 93% in multi-agent LLM inference by generating final responses using partial intermediate outputs instead of waiting for complete outputs.

DetailsMotivation: Multi-agent inference methods like Mixture-of-Agents improve response quality but significantly increase time to first token (TTFT), which is problematic for latency-sensitive applications and user experience.

Method: Propose staircase streaming that begins generating final response as soon as partial outputs are received from intermediate steps, rather than waiting for complete intermediate outputs.

Result: Experimental results show staircase streaming reduces TTFT by up to 93% while maintaining response quality.

Conclusion: Staircase streaming enables low-latency multi-agent inference without compromising response quality, making it suitable for latency-sensitive applications.

Abstract: Recent advances in large language models (LLMs) opened up new directions for leveraging the collective expertise of multiple LLMs. These methods, such as Mixture-of-Agents, typically employ additional inference steps to generate intermediate outputs, which are then used to produce the final response. While multi-agent inference can enhance response quality, it can significantly increase the time to first token (TTFT), posing a challenge for latency-sensitive applications and hurting user experience. To address this issue, we propose staircase streaming for low-latency multi-agent inference. Instead of waiting for the complete intermediate outputs from previous steps, we begin generating the final response as soon as we receive partial outputs from these steps. Experimental results demonstrate that staircase streaming reduces TTFT by up to 93% while maintaining response quality.

[593] CAG: Chunked Augmented Generation for Google Chrome’s Built-in Gemini Nano

Vivek Vellaiyappan Surulimuthu, Aditya Karnam Gururaj Rao

Main category: cs.AI

TL;DR: Chunked Augmented Generation (CAG) is an architecture that overcomes context window limitations in Chrome’s Gemini Nano model through intelligent input chunking and processing strategies.

DetailsMotivation: To address the restricted context window of Google Chrome's built-in Gemini Nano model, which poses challenges for processing large inputs despite being a significant advancement in browser AI capabilities.

Method: Uses intelligent input chunking and processing strategies to efficiently handle extensive content while maintaining model performance within browser constraints.

Result: Demonstrates particular efficacy in processing large documents and datasets directly within Chrome, making sophisticated AI capabilities accessible through the browser without external API dependencies.

Conclusion: CAG successfully enables efficient handling of large inputs within Chrome’s Gemini Nano model constraints, providing browser-native AI capabilities without external dependencies.

Abstract: We present Chunked Augmented Generation (CAG), an architecture specifically designed to overcome the context window limitations of Google Chrome’s built-in Gemini Nano model. While Chrome’s integration of Gemini Nano represents a significant advancement in bringing AI capabilities directly to the browser, its restricted context window poses challenges for processing large inputs. CAG addresses this limitation through intelligent input chunking and processing strategies, enabling efficient handling of extensive content while maintaining the model’s performance within browser constraints. Our implementation demonstrates particular efficacy in processing large documents and datasets directly within Chrome, making sophisticated AI capabilities accessible through the browser without external API dependencies. Get started now at https://github.com/vivekVells/cag-js.

[594] AgentBench: Evaluating LLMs as Agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, Jie Tang

Main category: cs.AI

TL;DR: AgentBench is a multi-dimensional benchmark with 8 environments to evaluate LLMs as agents, revealing significant performance gaps between top commercial LLMs and open-source models under 70B parameters.

DetailsMotivation: There's an urgent need to quantitatively evaluate LLMs as agents in interactive environments to assess their reasoning and decision-making abilities.

Method: Developed AgentBench with 8 distinct environments to test LLM-as-Agent capabilities across multiple dimensions.

Result: Top commercial LLMs show strong agent abilities, but there’s a significant performance gap with open-source competitors under 70B parameters. Main failure reasons include poor long-term reasoning, decision-making, and instruction following.

Conclusion: Improving instruction following and training on high-quality multi-round alignment data can enhance agent performance, while code training has ambivalent impacts on different agent tasks.

Abstract: The potential of Large Language Model (LLM) as agents has been widely acknowledged recently. Thus, there is an urgent need to quantitatively \textit{evaluate LLMs as agents} on challenging tasks in interactive environments. We present AgentBench, a multi-dimensional benchmark that consists of 8 distinct environments to assess LLM-as-Agent’s reasoning and decision-making abilities. Our extensive test over \num API-based and open-sourced (OSS) LLMs shows that, while top commercial LLMs present a strong ability of acting as agents in complex environments, there is a significant disparity in performance between them and many OSS competitors that are no larger than 70B. We identify the typical reasons of failures in environments and LLMs, showing that poor long-term reasoning, decision-making, and instruction following abilities are the main obstacles for developing usable LLM agents. Improving instruction following and training on high quality multi-round alignment data could improve agent performance. And different from existing assumptions, training on code present ambivalent impacts on different agent tasks. Datasets, environments, and an integrated evaluation package for AgentBench are released at https://github.com/THUDM/AgentBench.

[595] Graph Generation Powered with LLMs for Boosting Multivariate Time-Series Representation Learning

Yucheng Wang, Min Wu, Ruibing Jin, Xiaoli Li, Lihua Xie, Zhenghua Chen

Main category: cs.AI

TL;DR: K-Link is a novel framework that uses Large Language Models’ universal knowledge to improve graph generation for Multivariate Time-Series data, addressing biases from small training datasets.

DetailsMotivation: Existing MTS graph generation methods rely solely on data, making them vulnerable to biases from small training datasets, which hampers GNN performance in capturing spatial-temporal dependencies.

Method: Proposes K-Link framework that extracts Knowledge-Link graphs from LLMs capturing universal sensor knowledge, and uses graph alignment to transfer this knowledge to data-generated graphs.

Result: Extensive experiments show K-Link achieves superior performance on various MTS tasks by enhancing graph quality and representation learning.

Conclusion: Leveraging LLM knowledge for MTS graph generation effectively reduces biases and improves dependency modeling, leading to better GNN performance.

Abstract: Sourced from multiple sensors and organized chronologically, Multivariate Time-Series (MTS) data involves crucial spatial-temporal dependencies. To capture these dependencies, Graph Neural Networks (GNNs) have emerged as powerful tools. As explicit graphs are not inherent to MTS data, graph generation becomes a critical first step in adapting GNNs to this domain. However, existing approaches often rely solely on the data itself for MTS graph generation, leaving them vulnerable to biases from small training datasets. This limitation hampers their ability to construct effective graphs, undermining the accurate modeling of underlying dependencies in MTS data and reducing GNN performance in this field. To address this challenge, we propose a novel framework, K-Link, leveraging the extensive universal knowledge encoded in Large Language Models (LLMs) to reduce biases for powered MTS graph generation. To harness the knowledge within LLMs, such as physical principles, we design and extract a \textit{Knowledge-Link graph} that captures universal knowledge of sensors and their linkage. To empower MTS graph generation with the knowledge-link graph, we further introduce a graph alignment module that transfers universal knowledge from the knowledge-link graph to the graph generated from MTS data. This enhances the MTS graph quality, ensuring effective representation learning for MTS data. Extensive experiments demonstrate the efficacy of K-Link for superior performance on various MTS tasks.

[596] CHARME: A chain-based reinforcement learning approach for the minor embedding problem

Hoang M. Ngo, Nguyen H K. Do, Minh N. Vu, Tre’ R. Jeter, Tamer Kahveci, My T. Thai

Main category: cs.AI

TL;DR: CHARME is a reinforcement learning approach for solving the minor embedding problem in quantum annealing, using GNN for policy modeling and achieving better qubit usage than existing methods.

DetailsMotivation: Quantum annealing's effectiveness depends on embedding problem instances into quantum hardware with limited connectivity, but existing methods for this NP-hard minor embedding problem have scalability issues with larger problems.

Method: Proposed CHARME framework with three components: Graph Neural Network for policy modeling, state transition algorithm ensuring solution validity, and order exploration strategy for effective training.

Result: CHARME yields superior solutions in qubit usage compared to Minorminer and ATOM, surpasses OCT-based approach in several cases, and the exploration strategy enhances training efficiency over greedy approaches.

Conclusion: The RL-based CHARME framework provides an effective solution to the minor embedding problem, outperforming existing methods and demonstrating the potential of reinforcement learning for quantum computing optimization problems.

Abstract: Quantum annealing (QA) has great potential to solve combinatorial optimization problems efficiently. However, the effectiveness of QA algorithms is heavily based on the embedding of problem instances, represented as logical graphs, into the quantum processing unit (QPU) whose topology is in the form of a limited connectivity graph, known as the minor embedding problem. Because the minor embedding problem is an NP-hard problem~\mbox{\cite{Goodrich2018}}, existing methods for the minor embedding problem suffer from scalability issues when faced with larger problem sizes. In this paper, we propose a novel approach utilizing Reinforcement Learning (RL) techniques to address the minor embedding problem, named CHARME. CHARME includes three key components: a Graph Neural Network (GNN) architecture for policy modeling, a state transition algorithm that ensures solution validity, and an order exploration strategy for effective training. Through comprehensive experiments on synthetic and real-world instances, we demonstrate the efficiency of our proposed order exploration strategy as well as our proposed RL framework, CHARME. In particular, CHARME yields superior solutions in terms of qubit usage compared to fast embedding methods such as Minorminer and ATOM. Moreover, our method surpasses the OCT-based approach, known for its slower runtime but high-quality solutions, in several cases. In addition, our proposed exploration enhances the efficiency of the training of the CHARME framework by providing better solutions compared to the greedy strategy.

[597] Program Synthesis Benchmark for Visual Programming in XLogoOnline Environment

Chao Wen, Jacqueline Staub, Adish Singla

Main category: cs.AI

TL;DR: A novel program synthesis benchmark based on XLogoOnline visual programming tasks that require combined skills like spatial planning, programming, and reasoning, where current SOTA models perform poorly but fine-tuning with synthetic data and curriculum learning dramatically improves performance.

DetailsMotivation: To evaluate how well large language and multimodal models perform on tasks requiring combination of multiple skills (programming, math, visual reasoning) rather than individual benchmarks, as current models excel at specific skills but their performance on integrated tasks is unknown.

Method: Created a program synthesis benchmark from XLogoOnline visual programming environment tasks; developed fine-tuning pipeline using large-scale synthetic training dataset (80,000+ tasks); implemented emulator-driven feedback for curriculum learning over training data distribution.

Result: GPT-4V and Llama3-70B achieved only 20% and 2.35% success rates respectively; fine-tuned Llama3-8B drastically outperformed both larger models after curriculum-based training.

Conclusion: Current SOTA models struggle with tasks requiring combined skills, but targeted fine-tuning with synthetic data and curriculum learning can significantly boost performance, highlighting the need for specialized training approaches for multi-skill tasks.

Abstract: Large language and multimodal models have shown remarkable success on various benchmarks focused on specific skills such as general-purpose programming, math word problem-solving, and visual question answering. However, it is unclear how well these models perform on tasks that require a combination of these skills. In this paper, we curate a novel program synthesis benchmark based on the real-world tasks in the XLogoOnline visual programming environment. Each task requires a combination of different skills such as spatial planning, basic programming, and logical reasoning. Our evaluation shows that current state-of-the-art models like GPT-4V and Llama3-70B struggle to solve these tasks, achieving only 20% and 2.35% success rates, respectively. Next, we develop a fine-tuning pipeline to boost the performance of models by leveraging a large-scale synthetic training dataset with over 80,000 tasks. Moreover, we showcase how emulator-driven feedback can be used to design a curriculum over training data distribution, through which a fine-tuned Llama3-8B drastically outperforms GPT-4V and Llama3-70B models. Finally, we provide an in-depth failure analysis to understand the limitations of different models. We will publicly release the benchmark for future research on program synthesis in visual programming.

[598] Optimizing Agricultural Order Fulfillment Systems: A Hybrid Tree Search Approach

Pranay Thangeda, Hoda Helmi, Melkior Ornik

Main category: cs.AI

TL;DR: Proposes an adaptive hybrid tree search algorithm combining Monte Carlo tree search with domain knowledge to optimize seed order fulfillment in agricultural warehouses, handling unpredictable stock arrivals and strict deadlines.

DetailsMotivation: Address the challenge of efficient order fulfillment in agricultural seed supply chains due to seasonal nature, unpredictable stock arrivals, and strict order deadlines in centralized warehouses.

Method: Model wave scheduling as Markov decision process and develop adaptive hybrid tree search algorithm that combines Monte Carlo tree search with domain-specific knowledge to reduce action space complexity.

Result: Extensive simulations with realistic parameters show the proposed approach significantly outperforms existing industry standard methods in seed order fulfillment.

Conclusion: The hybrid tree search approach effectively handles large state and action spaces in dynamic seed distribution environments, enabling forecast-informed scheduling that balances immediate requirements with long-term efficiency.

Abstract: Efficient order fulfillment is vital in the agricultural industry, particularly due to the seasonal nature of seed supply chains. This paper addresses the challenge of optimizing seed orders fulfillment in a centralized warehouse where orders are processed in waves, taking into account the unpredictable arrival of seed stocks and strict order deadlines. We model the wave scheduling problem as a Markov decision process and propose an adaptive hybrid tree search algorithm that combines Monte Carlo tree search with domain-specific knowledge to efficiently navigate the complex, dynamic environment of seed distribution. By leveraging historical data and stochastic modeling, our method enables forecast-informed scheduling decisions that balance immediate requirements with long-term operational efficiency. The key idea is that we can augment Monte Carlo tree search algorithm with problem-specific side information that dynamically reduces the number of candidate actions at each decision step to handle the large state and action spaces that render traditional solution methods computationally intractable. Extensive simulations with realistic parameters, including a diverse range of products, a high volume of orders, and authentic seasonal durations, demonstrate that the proposed approach significantly outperforms existing industry standard methods.

[599] Efficiently Learning Probabilistic Logical Models by Cheaply Ranking Mined Rules

Jonathan Feldstein, Dominic Phillips, Efthymia Tsamoura

Main category: cs.AI

TL;DR: SPECTRUM is a scalable framework that learns logical theories from relational data using efficient algorithms for mining subgraphs and ranking rules based on a linear-time utility measure, achieving high accuracy with significantly reduced computational costs.

DetailsMotivation: Probabilistic logical models are important for neurosymbolic AI and explainable tasks, but current methods are either handcrafted (costly and error-prone) or computationally expensive, limiting real-world applicability.

Method: Introduces precision and recall for logical rules, defines rule utility as their composition, and develops SPECTRUM with linear-time algorithms for mining recurrent subgraphs and efficiently ranking rules using the utility measure.

Result: SPECTRUM scales to larger datasets, learning more accurate logical theories on CPUs in <1% the runtime of state-of-the-art neural network approaches on GPUs across various tasks.

Conclusion: The framework provides theoretical guarantees on utility and demonstrates practical scalability and efficiency for learning logical theories from relational data.

Abstract: Probabilistic logical models are a core component of neurosymbolic AI and are important in their own right for tasks that require high explainability. Unlike neural networks, logical theories that underlie the model are often handcrafted using domain expertise, making their development costly and prone to errors. While there are algorithms that learn logical theories from data, they are generally prohibitively expensive, limiting their applicability in real-world settings. Here, we introduce precision and recall for logical rules and define their composition as rule utility - a cost-effective measure of the predictive power of logical theories. We also introduce SPECTRUM, a scalable framework for learning logical theories from relational data. Its scalability derives from a linear-time algorithm for mining recurrent subgraphs in the data graph along with a second algorithm that, using a utility measure that can be computed in linear time, efficiently ranks rules derived from these subgraphs. Finally, we prove theoretical guarantees on the utility of the learnt logical theory. As a result, we demonstrate across various tasks that SPECTRUM scales to larger datasets, often learning more accurate logical theories on CPUs in < 1% the runtime of SOTA neural network approaches on GPUs.

[600] MAD-Sherlock: Multi-Agent Debate for Visual Misinformation Detection

Kumud Lakara, Georgia Channing, Christian Rupprecht, Juil Sock, Philip Torr, John Collomosse, Christian Schroeder de Witt

Main category: cs.AI

TL;DR: MAD-Sherlock is a multi-agent debate system that detects out-of-context misinformation by having multimodal agents collaboratively assess contextual consistency and retrieve external information, achieving state-of-the-art performance without domain-specific finetuning.

DetailsMotivation: To address the challenge of detecting misinformation created by pairing images with misleading text, overcoming limitations of existing AI systems that require domain-specific finetuning and lack explainability.

Method: Frames detection as a multi-agent debate where multimodal agents collaborate to assess contextual consistency and retrieve external information for cross-context reasoning. The framework is domain- and time-agnostic, requiring no finetuning.

Result: Outperforms prior methods by 2% on NewsCLIPpings, 3% on VERITE, and 5% on MMFakeBench. Ablation and user studies show debate and explanations significantly improve detection performance and trust for both experts and non-experts.

Conclusion: MAD-Sherlock positions as a robust tool for autonomous citizen intelligence by providing state-of-the-art accuracy with in-depth explanations, improving both detection performance and user trust.

Abstract: One of the most challenging forms of misinformation involves pairing images with misleading text to create false narratives. Existing AI-driven detection systems often require domain-specific finetuning, limiting generalizability, and offer little insight into their decisions, hindering trust and adoption. We introduce MAD-Sherlock, a multi-agent debate system for out-of-context misinformation detection. MAD-Sherlock frames detection as a multi-agent debate, reflecting the diverse and conflicting discourse found online. Multimodal agents collaborate to assess contextual consistency and retrieve external information to support cross-context reasoning. Our framework is domain- and time-agnostic, requiring no finetuning, yet achieves state-of-the-art accuracy with in-depth explanations. Evaluated on NewsCLIPpings, VERITE, and MMFakeBench, it outperforms prior methods by 2%, 3%, and 5%, respectively. Ablation and user studies show that the debate and resultant explanations significantly improve detection performance and improve trust for both experts and non-experts, positioning MAD-Sherlock as a robust tool for autonomous citizen intelligence.

[601] Can Large Language Models generalize analogy solving like children can?

Claire E. Stevenson, Alexandra Pafford, Han L. J. van der Maas, Melanie Mitchell

Main category: cs.AI

TL;DR: LLMs struggle with robust analogical transfer to new domains unlike humans who easily generalize analogy solving skills.

DetailsMotivation: To investigate whether LLMs can generalize analogy solving to new domains like humans do, comparing human performance (children and adults) with LLMs across different domains.

Method: Tested children, adults, and LLMs on letter-string analogies in Latin alphabet, near transfer (Greek alphabet), and far transfer (symbol lists) domains.

Result: Children and adults easily generalized analogy solving to unfamiliar domains, while LLMs failed to transfer their analogy solving capabilities to new domains.

Conclusion: LLMs still lack robust human-like analogical transfer capabilities, showing a key difference between human and AI performance in domain generalization.

Abstract: In people, the ability to solve analogies such as “body : feet :: table : ?” emerges in childhood, and appears to transfer easily to other domains, such as the visual domain “( : ) :: < : ?”. Recent research shows that large language models (LLMs) can solve various forms of analogies. However, can LLMs generalize analogy solving to new domains like people can? To investigate this, we had children, adults, and LLMs solve a series of letter-string analogies (e.g., a b : a c :: j k : ?) in the Latin alphabet, in a near transfer domain (Greek alphabet), and a far transfer domain (list of symbols). Children and adults easily generalized their knowledge to unfamiliar domains, whereas LLMs did not. This key difference between human and AI performance is evidence that these LLMs still struggle with robust human-like analogical transfer.

[602] Data clustering: a fundamental method in data science and management

Tai Dinh, Wong Hauchi, Daniil Lisik, Michal Koren, Dat Tran, Philip S. Yu, Joaquín Torres-Sospedra

Main category: cs.AI

TL;DR: This paper provides a comprehensive overview of data clustering in data science, covering traditional and advanced methodologies, tools, applications, challenges, and future research directions.

DetailsMotivation: To explore the critical role of data clustering in data science and emphasize its transformative potential in enabling data-driven decision-making.

Method: Analyzes traditional techniques (partitional and hierarchical clustering) alongside advanced approaches (data stream, density-based, graph-based, and model-based clustering) for handling complex structured datasets.

Result: The paper highlights key clustering principles, outlines widely used tools and frameworks, introduces clustering workflow, discusses implementation challenges, and examines various applications.

Conclusion: Clustering has transformative potential in data science, with future research directions focusing on driving innovation and enabling data-driven decision-making.

Abstract: This paper explores the critical role of data clustering in data science, emphasizing its methodologies, tools, and diverse applications. Traditional techniques, such as partitional and hierarchical clustering, are analyzed alongside advanced approaches such as data stream, density-based, graph-based, and model-based clustering for handling complex structured datasets. The paper highlights key principles underpinning clustering, outlines widely used tools and frameworks, introduces the workflow of clustering in data science, discusses challenges in practical implementation, and examines various applications of clustering. By focusing on these foundations and applications, the discussion underscores clustering’s transformative potential. The paper concludes with insights into future research directions, emphasizing clustering’s role in driving innovation and enabling data-driven decision-making.

[603] Neural Deconstruction Search for Vehicle Routing Problems

André Hottung, Paula Wong-Chung, Kevin Tierney

Main category: cs.AI

TL;DR: The paper introduces an iterative search framework that deconstructs and rebuilds solutions using neural policies combined with greedy insertion, challenging traditional sequential construction methods for vehicle routing problems.

DetailsMotivation: To challenge the conventional paradigm of sequential solution construction in vehicle routing problems and develop an alternative approach that can match or surpass state-of-the-art operations research methods.

Method: An iterative search framework where solutions are deconstructed by a neural policy, then rebuilt through collaboration between the neural policy and a simple greedy insertion algorithm.

Result: The approach matches or surpasses the performance of state-of-the-art operations research methods across three challenging vehicle routing problems of various problem sizes.

Conclusion: The proposed iterative deconstruction and reconstruction framework provides a viable alternative to sequential solution construction, achieving competitive performance with traditional operations research techniques.

Abstract: Autoregressive construction approaches generate solutions to vehicle routing problems in a step-by-step fashion, leading to high-quality solutions that are nearing the performance achieved by handcrafted operations research techniques. In this work, we challenge the conventional paradigm of sequential solution construction and introduce an iterative search framework where solutions are instead deconstructed by a neural policy. Throughout the search, the neural policy collaborates with a simple greedy insertion algorithm to rebuild the deconstructed solutions. Our approach matches or surpasses the performance of state-of-the-art operations research methods across three challenging vehicle routing problems of various problem sizes.

[604] Energy-Conscious LLM Decoding: Impact of Text Generation Strategies on GPU Energy Consumption

Alireza Nik, Michael A. Riegler, Pål Halvorsen

Main category: cs.AI

TL;DR: This paper analyzes how different text generation decoding strategies in LLMs affect GPU energy consumption, revealing significant energy impacts even when output quality remains similar.

DetailsMotivation: To investigate the understudied relationship between decoding techniques and energy efficiency in LLMs, focusing on the trade-off between generation quality and GPU energy usage.

Method: Benchmarked multiple decoding strategies across diverse tasks (Translation, Math Problem Solving, Coding, Open-ended generation) with various configurations to measure their impact on both text quality and energy consumption.

Result: Decoding strategy choice significantly impacts GPU energy usage even with minimal effect on output quality. Different strategies involve quality-energy trade-offs, and no single method performs best across all metrics.

Conclusion: This pioneering study provides insights for building energy-efficient LLM applications without compromising text quality by carefully selecting decoding strategies based on specific use cases and requirements.

Abstract: Decoding strategies significantly influence the quality and diversity of the generated text in Large Language Models (LLMs), yet their impact on computational resources, particularly GPU energy consumption, is insufficiently studied. This paper investigates the relationship between text generation decoding techniques and energy efficiency, focusing on the trade-off between generation quality and GPU energy usage across diverse tasks and decoding configurations. By benchmarking multiple strategies across various tasks, including Translation, Math Problem Solving, Coding, and Open-ended text generation, we reveal how selecting appropriate decoding techniques with their tuned hyperparameters affects text quality and has measurable implications for energy consumption. Our findings show that the choice of decoding strategy can greatly impact GPU energy usage, even when it has a minimal effect on output quality. Different strategies also involve trade-offs between quality and energy efficiency, and no single decoding method is best in all cases across every metric. To the best of our knowledge, this is one of the first studies to examine decoding strategies in LLMs from the perspective of energy consumption, providing useful insights for building energy-efficient applications without compromising text generation quality.

[605] Language Agents Mirror Human Causal Reasoning Biases. How Can We Help Them Think Like Scientists?

Anthony GX-Chen, Dongyan Lin, Mandana Samiei, Doina Precup, Blake A. Richards, Rob Fergus, Kenneth Marino

Main category: cs.AI

TL;DR: Language models exhibit a systematic “disjunctive bias” in causal reasoning, struggling with conjunctive relationships despite evidence, similar to human adults. A test-time sampling method helps reduce this bias.

DetailsMotivation: To examine whether language models possess the cognitive skill of efficiently exploring and understanding causal structures, particularly whether they exhibit systematic biases in causal reasoning.

Method: Using the Blicket Test paradigm from developmental psychology to test LMs’ causal inference abilities across different model families, sizes, and prompting strategies, plus proposing a test-time sampling method that explicitly samples and eliminates hypotheses.

Result: LMs reliably infer disjunctive causal relationships but systematically struggle with conjunctive ones, showing a persistent “disjunctive bias” that worsens with task complexity. This bias mirrors human adult reasoning patterns.

Conclusion: LMs inherit deep-seated reasoning heuristics from training data, exhibiting adult-like inference profiles. The proposed test-time sampling method significantly reduces the disjunctive bias, moving LMs toward more scientifically rigorous causal reasoning.

Abstract: Language model (LM) agents are increasingly used as autonomous decision-makers which need to actively gather information to guide their decisions. A crucial cognitive skill for such agents is the efficient exploration and understanding of the causal structure of the world – key to robust, scientifically grounded reasoning. Yet, it remains unclear whether LMs possess this capability or exhibit systematic biases leading to erroneous conclusions. In this work, we examine LMs’ ability to explore and infer causal relationships, using the well-established Blicket Test paradigm from developmental psychology. We find that LMs reliably infer the common, intuitive disjunctive causal relationships but systematically struggle with the unusual, yet equally (or sometimes even more) evidenced conjunctive ones. This “disjunctive bias” persists across model families, sizes, and prompting strategies, and performance further declines as task complexity increases. Interestingly, an analogous bias appears in human adults, suggesting that LMs may have inherited deep-seated reasoning heuristics from their training data. To this end, we quantify similarities between LMs and humans, finding that LMs exhibit adult-like inference profiles (but not child-like). Finally, we propose a test-time sampling method which explicitly samples and eliminates hypotheses about causal relationships from the LM. This scalable approach significantly reduces the disjunctive bias and moves LMs closer to the goal of scientific, causally rigorous reasoning.

[606] TIME: A Multi-level Benchmark for Temporal Reasoning of LLMs in Real-World Scenarios

Shaohang Wei, Wei Li, Feifan Song, Wen Luo, Tianyi Zhuang, Haochen Tan, Zhijiang Guo, Houfeng Wang

Main category: cs.AI

TL;DR: The paper introduces TIME, a multi-level benchmark for temporal reasoning in real-world scenarios, addressing challenges like intensive temporal information, fast-changing event dynamics, and complex temporal dependencies.

DetailsMotivation: Existing works neglect real-world challenges for temporal reasoning: intensive temporal information, fast-changing event dynamics, and complex temporal dependencies in social interactions.

Method: Proposed TIME benchmark with 38,522 QA pairs across 3 levels and 11 sub-tasks, covering three sub-datasets (TIME-Wiki, TIME-News, TIME-Dial) reflecting different real-world challenges.

Result: Extensive experiments conducted on reasoning and non-reasoning models, with in-depth analysis of temporal reasoning performance across diverse scenarios and tasks, plus study of test-time scaling impact.

Conclusion: The benchmark and TIME-Lite subset are released to foster future research and standardized evaluation in temporal reasoning, with code and datasets publicly available.

Abstract: Temporal reasoning is pivotal for Large Language Models (LLMs) to comprehend the real world. However, existing works neglect the real-world challenges for temporal reasoning: (1) intensive temporal information, (2) fast-changing event dynamics, and (3) complex temporal dependencies in social interactions. To bridge this gap, we propose a multi-level benchmark TIME, designed for temporal reasoning in real-world scenarios. TIME consists of 38,522 QA pairs, covering 3 levels with 11 fine-grained sub-tasks. This benchmark encompasses 3 sub-datasets reflecting different real-world challenges: TIME-Wiki, TIME-News, and TIME-Dial. We conduct extensive experiments on reasoning models and non-reasoning models. And we conducted an in-depth analysis of temporal reasoning performance across diverse real-world scenarios and tasks, and summarized the impact of test-time scaling on temporal reasoning capabilities. Additionally, we release TIME-Lite, a human-annotated subset to foster future research and standardized evaluation in temporal reasoning. The code is available at https://github.com/sylvain-wei/TIME , the dataset is available at https://huggingface.co/datasets/SylvainWei/TIME , and the project page link is https://sylvain-wei.github.io/TIME/ .

[607] RECAST: Expanding the Boundaries of LLMs’ Complex Instruction Following with Multi-Constraint Data

Zhengkang Guo, Wenhao Liu, Mingchen Xie, Jingwen Xu, Zisu Huang, Muzhao Tian, Jianhan Xu, Yuanzhe Shen, Qi Qian, Muling Wu, Xiaohua Wang, Changze Lv, He-Da Wang, Hu Yao, Xiaoqing Zheng, Xuanjing Huang

Main category: cs.AI

TL;DR: RECAST is a framework for creating datasets with many constraints (more than 10) to improve LLMs’ ability to follow complex instructions, resulting in RECAST-30K dataset that significantly boosts model performance.

DetailsMotivation: LLMs struggle with complex instructions containing many constraints (especially >10), limiting their real-world applicability, and existing datasets don't adequately test this capability.

Method: Proposed RECAST framework extracts constraints from real-world prompt-response pairs to synthesize datasets with many constraints, and uses rule-based and LLM-based validators for automatic verification.

Result: Models finetuned on RECAST-30K substantially improve in following complex instructions while maintaining general capabilities, and the verifiability enables reward functions for RL that further boost performance.

Conclusion: RECAST effectively addresses LLMs’ limitations with complex multi-constraint instructions and provides a scalable framework for improving instruction-following capabilities in challenging real-world scenarios.

Abstract: Large language models (LLMs) are increasingly expected to tackle complex tasks, driven by their expanding applications and users’ growing proficiency in crafting sophisticated prompts. However, as the number of explicitly stated requirements increases (particularly more than 10 constraints), LLMs often struggle to accurately follow such complex instructions, which limits their applicability in complex real-world scenarios. To the best of our knowledge, existing datasets do not exceed 10 constraints per instance. To address this challenge, we propose RECAST, an efficient and scalable framework for synthesizing datasets where each example incorporates far more constraints than those in existing benchmarks, aiming to challenge and extend the boundaries of models’ ability to follow complex instructions. These constraints are extracted from real-world prompt-response pairs to ensure practical relevance. Using this framework, we construct RECAST-30K, a large-scale, high-quality dataset comprising 30k instances spanning 19 constraint types. Experimental results demonstrate that models finetuned on RECAST-30K substantially improve in following complex instructions while maintaining their general capabilities without degradation. Moreover, RECAST enables automatic verification of constraint satisfaction via rule-based validators for quantitative constraints and LLM-based validators for qualitative ones; the verifiability provided by RECAST enables the design of reward functions for reinforcement learning, which further boosts model performance on complex and challenging tasks.

[608] PatentMind: A Multi-Aspect Reasoning Graph for Patent Similarity Evaluation

Yongmin Yoo, Qiongkai Xu, Longbing Cao

Main category: cs.AI

TL;DR: PatentMind is a novel framework for patent similarity assessment using Multi-Aspect Reasoning Graph (MARG) that decomposes patents into technical features, application domains, and claim scopes, then calculates dynamically weighted similarity scores.

DetailsMotivation: Existing patent similarity methods overlook the intricate structure of patent documents that integrate technical specifications, legal boundaries, and application contexts, failing to capture the multi-dimensional nature of patents.

Method: PatentMind decomposes patents into three dimensions (technical features, application domains, claim scopes), calculates dimension-specific similarity scores over MARG, and dynamically weights them through context-aware reasoning to emulate expert judgment.

Result: PatentMind achieves strong correlation (r=0.938) with expert annotations on PatentSimBench (500 patent pairs), significantly outperforming embedding-based models, patent-specific models, and advanced prompt engineering methods.

Conclusion: The framework provides a structured, semantically grounded foundation for real-world patent decision-making, particularly for infringement risk assessment, demonstrating broader impact on patent analytics and evaluation.

Abstract: Patent similarity evaluation plays a critical role in intellectual property analysis. However, existing methods often overlook the intricate structure of patent documents, which integrate technical specifications, legal boundaries, and application contexts. We introduce PatentMind, a novel framework for patent similarity assessment based on a Multi-Aspect Reasoning Graph (MARG). PatentMind decomposes patents into their three dimensions of technical features, application domains, and claim scopes, then dimension-specific similarity scores are calculated over the MARG. These scores are dynamically weighted through a context-aware reasoning process, which integrates contextual signals to emulate expert-level judgment. To support evaluation, we construct a human-annotated benchmark PatentSimBench, comprising 500 patent pairs. Experimental results demonstrate that the PatentMind-generated scores show a strong correlation ($r=0.938$) with expert annotations, significantly outperforming embedding-based models, patent-specific models, and advanced prompt engineering methods. Beyond computational linguistics, our framework provides a structured and semantically grounded foundation for real-world decision-making, particularly for tasks such as infringement risk assessment, underscoring its broader impact on both patent analytics and evaluation.

[609] Contextual Integrity in LLMs via Reasoning and Reinforcement Learning

Guangchen Lan, Huseyin A. Inan, Sahar Abdelnabi, Janardhan Kulkarni, Lukas Wutschitz, Reza Shokri, Christopher G. Brinton, Robert Sim

Main category: cs.AI

TL;DR: The paper presents a method to improve contextual integrity in autonomous agents by combining explicit reasoning prompts with reinforcement learning, reducing inappropriate information disclosure while maintaining task performance.

DetailsMotivation: As autonomous agents make decisions for users, ensuring contextual integrity (determining appropriate information to share for specific tasks) becomes crucial, requiring agents to reason about their operating context.

Method: First prompted LLMs to reason explicitly about contextual integrity, then developed a reinforcement learning framework to instill reasoning capabilities. Used a synthetic dataset of ~700 examples with diverse contexts and information disclosure norms.

Result: Substantially reduced inappropriate information disclosure while maintaining task performance across multiple model sizes and families. Improvements transferred from synthetic dataset to established CI benchmarks like PrivacyLens with human annotations.

Conclusion: The approach successfully enables agents to reason about contextual integrity, reducing privacy violations while preserving functionality, with transferable improvements to real-world benchmarks.

Abstract: As the era of autonomous agents making decisions on behalf of users unfolds, ensuring contextual integrity (CI) – what is the appropriate information to share while carrying out a certain task – becomes a central question to the field. We posit that CI demands a form of reasoning where the agent needs to reason about the context in which it is operating. To test this, we first prompt LLMs to reason explicitly about CI when deciding what information to disclose. We then extend this approach by developing a reinforcement learning (RL) framework that further instills in models the reasoning necessary to achieve CI. Using a synthetic, automatically created, dataset of only $\sim700$ examples but with diverse contexts and information disclosure norms, we show that our method substantially reduces inappropriate information disclosure while maintaining task performance across multiple model sizes and families. Importantly, improvements transfer from this synthetic dataset to established CI benchmarks such as PrivacyLens that has human annotations and evaluates privacy leakage of AI assistants in actions and tool calls.

[610] Beyond RLHF and NLHF: Population-Proportional Alignment under an Axiomatic Framework

Kihyun Kim, Jiawei Zhang, Asuman Ozdaglar, Pablo A. Parrilo

Main category: cs.AI

TL;DR: A novel preference learning framework that aligns aggregate opinions proportionally with true population distribution, addressing bias and manipulation in conventional methods.

DetailsMotivation: Conventional preference learning methods prioritize widely held opinions, leading to biased policies and susceptibility to strategic manipulation.

Method: Social choice theory-based approach that infers evaluator population distributions from pairwise comparisons, constructs policies satisfying monotonicity, Pareto efficiency, and new axioms of population-proportional alignment and bounded manipulability, with soft-max relaxation for trade-offs.

Result: The approach effectively addresses bias and manipulation while maintaining foundational social choice axioms, validated through experiments on tabular recommendation tasks and large language model alignment.

Conclusion: The framework provides a principled solution for proportional preference aggregation that is robust to manipulation and scalable across different applications.

Abstract: Conventional preference learning methods often prioritize opinions held more widely when aggregating preferences from multiple evaluators. This may result in policies that are biased in favor of some types of opinions or groups and susceptible to strategic manipulation. To address this issue, we develop a novel preference learning framework capable of aligning aggregate opinions and policies proportionally with the true population distribution of evaluator preferences. Grounded in social choice theory, our approach infers the feasible set of evaluator population distributions directly from pairwise comparison data. Using these estimates, the algorithm constructs a policy that satisfies foundational axioms from social choice theory, namely monotonicity and Pareto efficiency, as well as our newly-introduced axioms of population-proportional alignment and population-bounded manipulability. Moreover, we propose a soft-max relaxation method that smoothly trade-offs population-proportional alignment with the selection of the Condorcet winner (which beats all other options in pairwise comparisons). Finally, we validate the effectiveness and scalability of our approach through experiments on both tabular recommendation tasks and large language model alignment.

[611] Learning What Reinforcement Learning Can’t: Interleaved Online Fine-Tuning for Hardest Questions

Lu Ma, Hao Liang, Meiyi Qiang, Lexiang Tang, Xiaochen Ma, Zhen Hao Wong, Junbo Niu, Chengyu Shen, Runming He, Yanhao Li, Bin Cui, Wentao Zhang

Main category: cs.AI

TL;DR: ReLIFT is a novel training approach that interleaves reinforcement learning with online fine-tuning to overcome RL’s limitations in acquiring new knowledge, achieving significant improvements on reasoning benchmarks.

DetailsMotivation: Current RL methods for LLM reasoning are insufficient for acquiring new knowledge beyond the base model's capabilities, as they primarily optimize based on existing knowledge rather than facilitating new information acquisition.

Method: ReLIFT alternates between RL training and supervised fine-tuning using high-quality solutions collected when the model encounters challenging questions, combining the strengths of both approaches.

Result: ReLIFT achieves an average improvement of +5.2 points across five competition-level benchmarks and one out-of-distribution benchmark compared to zero-RL models, while using only 13% of detailed demonstration data.

Conclusion: ReLIFT overcomes the fundamental limitations of RL in LLM reasoning and demonstrates significant potential for scalable reasoning capability enhancement through complementary RL and SFT training.

Abstract: Recent advances in large language model (LLM) reasoning have shown that sophisticated behaviors such as planning and self-reflection can emerge through reinforcement learning (RL). However, despite these successes, RL in its current form remains insufficient to induce capabilities that exceed the limitations of the base model, as it is primarily optimized based on existing knowledge of the model rather than facilitating the acquisition of new information. To address this limitation, we employ supervised fine-tuning (SFT) to learn what RL cannot, which enables the incorporation of new knowledge and reasoning patterns by leveraging high-quality demonstration data. We analyze the training dynamics of RL and SFT for LLM reasoning and find that RL excels at maintaining and improving performance on questions within the model’s original capabilities, while SFT is more effective at enabling progress on questions beyond the current scope of the model. Motivated by the complementary strengths of RL and SFT, we introduce a novel training approach, \textbf{ReLIFT} (\textbf{Re}inforcement \textbf{L}earning \textbf{I}nterleaved with Online \textbf{F}ine-\textbf{T}uning). In ReLIFT, the model is primarily trained using RL, but when it encounters challenging questions, high-quality solutions are collected for fine-tuning, and the training process alternates between RL and fine-tuning to enhance the model’s reasoning abilities. ReLIFT achieves an average improvement of over +5.2 points across five competition-level benchmarks and one out-of-distribution benchmark compared to other zero-RL models. Furthermore, we demonstrate that ReLIFT outperforms both RL and SFT while using only 13% of the detailed demonstration data, highlighting its scalability. These results provide compelling evidence that ReLIFT overcomes the fundamental limitations of RL and underscores the significant potential.

[612] ALE-Bench: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering

Yuki Imajuku, Kohki Horie, Yoichi Iwata, Kensho Aoki, Naohiro Takahashi, Takuya Akiba

Main category: cs.AI

TL;DR: ALE-Bench is a new benchmark for evaluating AI systems on score-based algorithmic programming contests using real optimization problems from AtCoder Heuristic Contests, focusing on long-horizon iterative refinement rather than pass/fail coding tests.

DetailsMotivation: To assess AI performance on hard optimization problems in practical domains like routing, scheduling, and planning, and to address the gap in existing benchmarks that don't support long-term iterative solution development.

Method: Created ALE-Bench using real tasks from AtCoder Heuristic Contests, featuring computationally hard optimization problems with no known exact solutions. Developed a software framework supporting interactive agent architectures with test-run feedback and visualizations.

Result: Evaluation of frontier LLMs showed they perform well on specific problems but lack consistency across different problems and struggle with long-horizon problem-solving compared to humans.

Conclusion: ALE-Bench is needed to drive future AI advancements by providing a proper benchmark for evaluating iterative, long-term optimization problem-solving capabilities where current AI systems still fall short of human performance.

Abstract: How well do AI systems perform in algorithm engineering for hard optimization problems in domains such as package-delivery routing, crew scheduling, factory production planning, and power-grid balancing? We introduce ALE-Bench, a new benchmark for evaluating AI systems on score-based algorithmic programming contests. Drawing on real tasks from the AtCoder Heuristic Contests, ALE-Bench presents optimization problems that are computationally hard and admit no known exact solution. Unlike short-duration, pass/fail coding benchmarks, ALE-Bench encourages iterative solution refinement over long time horizons. Our software framework supports interactive agent architectures that leverage test-run feedback and visualizations. Our evaluation of frontier LLMs revealed that while they demonstrate high performance on specific problems, a notable gap remains compared to humans in terms of consistency across problems and long-horizon problem-solving capabilities. This highlights the need for this benchmark to foster future AI advancements.

[613] Machine Learning as Iterated Belief Change a la Darwiche and Pearl

Theofanis Aravanis

Main category: cs.AI

TL;DR: This paper extends previous work on binary ANNs by modeling their training dynamics using robust AGM-style belief change operations (lexicographic revision and moderate contraction) instead of full-meet belief change, addressing limitations of the earlier approach.

DetailsMotivation: To address critical limitations in previous work that modeled binary ANN training using full-meet AGM belief change, which has known shortcomings, by developing a more effective framework using robust belief change operations.

Method: The authors use Dalal’s method for belief change to induce structured evolution of belief states, and demonstrate that binary ANN training dynamics can be better modeled using lexicographic revision and moderate contraction operations that align with the Darwiche-Pearl framework for iterated belief change.

Result: The study shows that robust AGM-style change operations provide a more effective model for binary ANN training dynamics compared to the previous full-meet belief change approach.

Conclusion: Binary ANN training can be effectively modeled using robust belief change operations (lexicographic revision and moderate contraction) within the Darwiche-Pearl framework, overcoming limitations of the previous full-meet belief change approach.

Abstract: Artificial Neural Networks (ANNs) are powerful machine-learning models capable of capturing intricate non-linear relationships. They are widely used nowadays across numerous scientific and engineering domains, driving advancements in both research and real-world applications. In our recent work, we focused on the statics and dynamics of a particular subclass of ANNs, which we refer to as binary ANNs. A binary ANN is a feed-forward network in which both inputs and outputs are restricted to binary values, making it particularly suitable for a variety of practical use cases. Our previous study approached binary ANNs through the lens of belief-change theory, specifically the Alchourron, Gardenfors and Makinson (AGM) framework, yielding several key insights. Most notably, we demonstrated that the knowledge embodied in a binary ANN (expressed through its input-output behaviour) can be symbolically represented using a propositional logic language. Moreover, the process of modifying a belief set (through revision or contraction) was mapped onto a gradual transition through a series of intermediate belief sets. Analogously, the training of binary ANNs was conceptualized as a sequence of such belief-set transitions, which we showed can be formalized using full-meet AGM-style belief change. In the present article, we extend this line of investigation by addressing some critical limitations of our previous study. Specifically, we show that Dalal’s method for belief change naturally induces a structured, gradual evolution of states of belief. More importantly, given the known shortcomings of full-meet belief change, we demonstrate that the training dynamics of binary ANNs can be more effectively modelled using robust AGM-style change operations – namely, lexicographic revision and moderate contraction – that align with the Darwiche-Pearl framework for iterated belief change.

[614] Think How to Think: Mitigating Overthinking with Autonomous Difficulty Cognition in Large Reasoning Models

Yongjiang Liu, Haoxi Li, Xiaosong Ma, Jie Zhang, Song Guo

Main category: cs.AI

TL;DR: TH2T is a two-stage fine-tuning strategy that helps Large Reasoning Models recognize task difficulty and reduce redundant reasoning, cutting inference costs by 70% on easy tasks and 40% on hard tasks while maintaining performance.

DetailsMotivation: LRMs suffer from overthinking, generating overly long reasoning paths due to inability to recognize task difficulty levels before solving problems, leading to one-size-fits-all reasoning.

Method: Two-stage fine-tuning: (1) Inject difficulty hypnosis into output prefixes to guide adaptive reasoning depth using hybrid dataset, (2) Incorporate redundancy hypnosis to supervise intermediate steps to eliminate unnecessary reasoning patterns.

Result: Reduces inference costs by over 70% on easy tasks and 40% on hard tasks while maintaining performance stability. Models show clear difficulty-aware capabilities and reduced redundancy.

Conclusion: TH2T effectively alleviates overthinking in LRMs by bootstrapping difficulty and redundancy cognition, enabling more efficient and adaptive reasoning without performance degradation.

Abstract: Recent Large Reasoning Models (LRMs) excel at complex reasoning tasks but often suffer from overthinking, generating overly long and redundant reasoning trajectories. To explore its essence, our empirical analysis reveals that LRMs are primarily limited to recognizing task properties (i.e., difficulty levels) like humans before solving the problem, leading to a one-size-fits-all reasoning process. Inspired by this, a pressing and natural question emerges: Can we explicitly bootstrap such ability to alleviate overthinking in LRMs? In this paper, we propose Think-How-to-Think (TH2T), a novel two-stage fine-tuning strategy that progressively inspires LRMs’ difficulty cognition and redundancy cognition of LRMs. Specifically, we first inject difficulty hypnosis into output prefixes to guide the model toward adaptive reasoning depth, trained on a hybrid dataset mixing short and long reasoning paths. Then, we incorporate redundancy hypnosis, which supervises the intermediate reasoning steps to identify and eliminate unnecessary reasoning patterns. Experiments on 7B/14B/32B models demonstrate that TH2T significantly reduces inference costs by over 70% on easy tasks and 40% on hard tasks while maintaining performance stability. The resulting outputs exhibit clear signs of difficulty-aware capabilities and reduced redundancy (e.g., reflection and looping).

[615] GTA1: GUI Test-time Scaling Agent

Yan Yang, Dongxu Li, Yutong Dai, Yuhao Yang, Ziyang Luo, Zirui Zhao, Zhiyuan Hu, Junzhe Huang, Amrita Saha, Zeyuan Chen, Ran Xu, Liyuan Pan, Silvio Savarese, Caiming Xiong, Junnan Li

Main category: cs.AI

TL;DR: GTA1 is a GUI agent that addresses planning challenges in expansive action spaces and improves action grounding through test-time scaling with concurrent sampling and reinforcement learning.

DetailsMotivation: GUI agents face challenges in planning under expansive action spaces (many valid action sequences exist) and accurately grounding actions in complex, high-resolution interfaces.

Method: 1) Test-time scaling: samples multiple candidate action proposals, evaluates them with a judge model, and selects the best one. 2) Reinforcement learning-based grounding: improves visual element interaction through objective alignment that rewards successful clicks.

Result: GTA1 achieves state-of-the-art performance on both grounding and agent task execution benchmarks.

Conclusion: The proposed test-time scaling and RL-based grounding methods effectively address planning and action grounding challenges in GUI agents, leading to superior performance.

Abstract: Graphical user interface (GUI) agents autonomously complete tasks across platforms (\eg, Linux) by sequentially decomposing user instructions into action proposals that iteratively interact with visual elements in the evolving environment. However, two main challenges arise: i) planning (\ie, the action proposal sequence) under expansive action space, where selecting an appropriate plan is non-trivial, as many valid ones may exist; ii) accurately grounding actions in complex and high-resolution interfaces, \ie, precisely interacting with visual targets. This paper investigates the aforementioned challenges with our \textbf{G}UI \textbf{T}est-time Scaling \textbf{A}gent, namely GTA1. First, we conduct test-time scaling to select the most appropriate action proposal: at each step, multiple candidate proposals are sampled and evaluated and selected by a judge model. It trades off computation for better decision quality by concurrent sampling. Second, we propose a model that improves grounding of the selected action proposals to its corresponding visual elements. Our key insight is that reinforcement learning (RL) facilitates grounding through inherent objective alignments, rewarding successful clicks on interface elements. Experimentally, GTA1 achieves state-of-the-art performance on both grounding and agent task execution benchmarks. The code and models are released here.

[616] FloorplanQA: A Benchmark for Spatial Reasoning in LLMs using Structured Representations

Fedor Rodionov, Abdelrahman Eldesokey, Michael Birsak, John Femiani, Bernard Ghanem, Peter Wonka

Main category: cs.AI

TL;DR: FloorplanQA is a diagnostic benchmark for evaluating spatial reasoning in LLMs using structured indoor scene representations in JSON/XML formats.

DetailsMotivation: To identify limitations in LLMs' spatial reasoning capabilities, particularly their inability to respect physical constraints and maintain spatial coherence in indoor layouts.

Method: Created a benchmark with structured representations of indoor scenes (kitchens, living rooms, etc.) in JSON/XML formats, testing core spatial tasks like distance measurement, visibility, path finding, and object placement.

Result: LLMs succeed in shallow queries but fail to respect physical constraints and preserve spatial coherence, though they remain mostly robust to small spatial perturbations. The benchmark reveals inconsistent reasoning about indoor layouts.

Conclusion: FloorplanQA uncovers a blind spot in current LLMs regarding spatial reasoning and aims to inspire development of models that can better infer and manipulate spatial properties in practical settings.

Abstract: We introduce FloorplanQA, a diagnostic benchmark for evaluating spatial reasoning in large-language models (LLMs). FloorplanQA is grounded in structured representations of indoor scenes, such as (e.g., kitchens, living rooms, bedrooms, bathrooms, and others), encoded symbolically in JSON or XML layouts. The benchmark covers core spatial tasks, including distance measurement, visibility, path finding, and object placement within constrained spaces. Our results across a variety of frontier open-source and commercial LLMs reveal that while models may succeed in shallow queries, they often fail to respect physical constraints, preserve spatial coherence, though they remain mostly robust to small spatial perturbations. FloorplanQA uncovers a blind spot in today’s LLMs: inconsistent reasoning about indoor layouts. We hope this benchmark inspires new work on language models that can accurately infer and manipulate spatial and geometric properties in practical settings.

[617] Cross-Modal Distillation For Widely Differing Modalities

Cairong Zhao, Yufeng Jin, Zifan Song, Haonan Chen, Duoqian Miao, Guosheng Hu

Main category: cs.AI

TL;DR: Proposes a cross-modal distillation framework with soft constrained knowledge transfer strategies and quality-based adaptive weights to prevent overfitting when transferring knowledge between different modalities like image, text, and speech.

DetailsMotivation: To address the challenge of limited multi-modal data access by using teacher models to transfer discriminative knowledge to student models, while overcoming the overfitting problem caused by the big domain gap between different modalities.

Method: Introduces soft constrained knowledge distillation strategies at feature and classifier levels instead of hard constrained loss, and proposes a quality-based adaptive weights module to weigh input samples based on quantified data quality.

Result: Experiments on speaker recognition and image classification tasks show effective knowledge transfer between image, text, and speech modalities.

Conclusion: The proposed cross-modal distillation framework successfully enables knowledge transfer between widely differing modalities while preventing overfitting through soft constraints and quality-based weighting.

Abstract: Deep learning achieved great progress recently, however, it is not easy or efficient to further improve its performance by increasing the size of the model. Multi-modal learning can mitigate this challenge by introducing richer and more discriminative information as input. To solve the problem of limited access to multi-modal data at the time of use, we conduct multi-modal learning by introducing a teacher model to transfer discriminative knowledge to a student model during training. However, this knowledge transfer via distillation is not trivial because the big domain gap between the widely differing modalities can easily lead to overfitting. In this work, we introduce a cross-modal distillation framework. Specifically, we find hard constrained loss, e.g. l2 loss forcing the student being exact the same as the teacher, can easily lead to overfitting in cross-modality distillation. To address this, we propose two soft constrained knowledge distillation strategies at the feature level and classifier level respectively. In addition, we propose a quality-based adaptive weights module to weigh input samples via quantified data quality, leading to robust model training. We conducted experiments on speaker recognition and image classification tasks, and the results show that our approach is able to effectively achieve knowledge transfer between the commonly used and widely differing modalities of image, text, and speech.

[618] QCBench: Evaluating Large Language Models on Domain-Specific Quantitative Chemistry

Jiaqing Xie, Weida Wang, Ben Gao, Zhuo Yang, Haiyuan Wan, Shufei Zhang, Tianfan Fu, Yuqiang Li

Main category: cs.AI

TL;DR: QCBench is a benchmark of 350 quantitative chemistry problems across 7 subfields to evaluate LLMs’ mathematical reasoning in chemistry, showing performance degrades with complexity.

DetailsMotivation: To address the underexplored ability of LLMs to perform rigorous quantitative chemistry calculations and provide systematic evaluation of their mathematical reasoning in chemical contexts.

Method: Created QCBench with 350 computational chemistry problems across 7 subfields, categorized into easy/medium/difficult tiers, using realistic chemical scenarios that require explicit numerical reasoning.

Result: Evaluation of 24 LLMs showed consistent performance degradation with increasing task complexity, revealing gaps between language fluency and scientific computation accuracy.

Conclusion: QCBench enables fine-grained diagnosis of computational weaknesses and lays groundwork for future improvements like domain-adaptive fine-tuning or multi-modal integration in quantitative chemistry.

Abstract: Quantitative chemistry is central to modern chemical research, yet the ability of large language models (LLMs) to perform its rigorous, step-by-step calculations remains underexplored. To fill this blank, we propose QCBench, a Quantitative Chemistry oriented benchmark comprising 350 computational chemistry problems across 7 chemistry subfields, which contains analytical chemistry, bio/organic chemistry, general chemistry, inorganic chemistry, physical chemistry, polymer chemistry and quantum chemistry. To systematically evaluate the mathematical reasoning abilities of large language models (LLMs), they are categorized into three tiers: easy, medium, and difficult. Each problem, rooted in realistic chemical scenarios, is structured to prevent heuristic shortcuts and demand explicit numerical reasoning. QCBench enables fine-grained diagnosis of computational weaknesses, reveals model-specific limitations across difficulty levels, and lays the groundwork for future improvements such as domain-adaptive fine-tuning or multi-modal integration. Evaluations on 24 LLMs demonstrate a consistent performance degradation with increasing task complexity, highlighting the current gap between language fluency and scientific computation accuracy. Code for QCBench is available at https://github.com/jiaqingxie/QCBench.

[619] Towards Generalizable Context-aware Anomaly Detection: A Large-scale Benchmark in Cloud Environments

Xinkai Zou, Xuan Jiang, Ruikai Huang, Haoze He, Parv Kapoor, Hongrui Wu, Yibo Wang, Jian Sha, Xiongbo Shi, Zixun Huang, Jinhua Zhao

Main category: cs.AI

TL;DR: CloudAnoBench is a large-scale benchmark for context anomalies in cloud environments that combines metrics and logs, with 28 anomalous and 16 deceptive normal scenarios. CloudAnoAgent, an LLM-based agent with symbolic verification, achieves substantial improvements in detection and shows strong generalization.

DetailsMotivation: Existing benchmarks focus on single modalities (metrics or logs) and lack reliable annotation, while detection methods overlook contextual signals and have limited real-world applicability. Constructing cross-modal benchmarks is challenging due to infeasibility of reproducing real anomalies and maintaining cross-modal consistency in synthetic data.

Method: Created CloudAnoBench with 28 anomalous scenarios and 16 deceptive normal scenarios, containing 1,252 labeled cases and ~200,000 log and metric entries. Proposed CloudAnoAgent, an LLM-based agent enhanced by symbolic verification that integrates metrics and logs for context-aware anomaly detection.

Result: CloudAnoBench exhibits higher ambiguity and greater difficulty than prior benchmarks, where both traditional ML methods and vanilla LLM prompting perform poorly. CloudAnoAgent achieves substantial improvements in both anomaly detection and scenario identification on CloudAnoBench, and shows strong generalization to existing datasets.

Conclusion: CloudAnoBench and CloudAnoAgent lay the groundwork for advancing context-aware anomaly detection in cloud systems by addressing the limitations of single-modality approaches and providing a comprehensive benchmark with effective detection methods.

Abstract: Anomaly detection in cloud environments remains both critical and challenging. Existing context-level benchmarks typically focus on either metrics or logs and often lack reliable annotation, while most detection methods emphasize point anomalies within a single modality, overlooking contextual signals and limiting real-world applicability. Constructing a benchmark for context anomalies that combines metrics and logs is inherently difficult: reproducing anomalous scenarios on real servers is often infeasible or potentially harmful, while generating synthetic data introduces the additional challenge of maintaining cross-modal consistency. We introduce CloudAnoBench, a large-scale benchmark for context anomalies in cloud environments, comprising 28 anomalous scenarios and 16 deceptive normal scenarios, with 1,252 labeled cases and roughly 200,000 log and metric entries. Compared with prior benchmarks, CloudAnoBench exhibits higher ambiguity and greater difficulty, on which both prior machine learning methods and vanilla LLM prompting perform poorly. To demonstrate its utility, we further propose CloudAnoAgent, an LLM-based agent enhanced by symbolic verification that integrates metrics and logs. This agent system achieves substantial improvements in both anomaly detection and scenario identification on CloudAnoBench, and shows strong generalization to existing datasets. Together, CloudAnoBench and CloudAnoAgent lay the groundwork for advancing context-aware anomaly detection in cloud systems. Project Page: https://jayzou3773.github.io/cloudanobench-agent/

[620] Trainable Dynamic Mask Sparse Attention

Jingze Shi, Yifan Wu, Yiran Peng, Bingheng Wu, Liangdong Wang, Guang Liu, Yuyu Luo

Main category: cs.AI

TL;DR: Dynamic Mask Attention is a trainable sparse attention mechanism that uses value vectors to generate content-aware sparse masks, enabling adaptive focus on crucial information while reducing computational complexity by up to 10x.

DetailsMotivation: Standard self-attention has quadratic complexity that limits long-context modeling in large language models, while existing sparse attention methods suffer from static patterns and information loss.

Method: Three key innovations: 1) Content-aware sparse masks generated dynamically from value vectors, 2) Position-aware sparse attention computation that skips unnecessary regions, 3) Gradient-preserving design that supports end-to-end training without obstructing gradients.

Result: Achieves Pareto dominance across various tasks including scaling laws, multi-query associative recall, general benchmarks, and needle-in-a-haystack tests, with up to 10 times acceleration while maintaining performance.

Conclusion: The method effectively balances model efficiency with long-context modeling through dual-sparsity design that retains complete information while significantly reducing computational complexity.

Abstract: In large language models, the demand for modeling long contexts is ever-increasing, yet the quadratic complexity of standard self-attention presents a significant bottleneck. While existing sparse attention mechanisms enhance efficiency, they often suffer from limitations such as static patterns and information loss. This paper introduces a Trainable Dynamic Mask Sparse Attention mechanism that addresses these challenges through three key innovations. First, it leverages value vectors to dynamically generate content-aware sparse masks, enabling the model to adaptively identify and focus on crucial information. Second, it implements a position-aware sparse attention computation that effectively skips unnecessary computational regions. Finally, we ensure that the introduced dynamic masks and sparse weights do not obstruct gradients, thereby supporting end-to-end training. This dual-sparsity design allows the model to retain complete information while significantly reducing computational complexity, achieving an excellent balance between efficiency and performance. We validate the performance of Dynamic Mask Attention through comprehensive experiments. Comparative studies demonstrate that our method consistently achieves Pareto dominance across various tasks, including scaling laws, multi-query associative recall, general benchmarks, and needle-in-a-haystack tests, delivering up to 10 times acceleration. These results highlight its capability to effectively balance model efficiency with long-context modeling. Our computational kernel is open-sourced at https://github.com/SmallDoges/flash-dmattn to facilitate further research and application within the community.

[621] OpenCUA: Open Foundations for Computer-Use Agents

Xinyuan Wang, Bowen Wang, Dunjie Lu, Junlin Yang, Tianbao Xie, Junli Wang, Jiaqi Deng, Xiaole Guo, Yiheng Xu, Chen Henry Wu, Zhennan Shen, Zhuokai Li, Ryan Li, Xiaochuan Li, Junda Chen, Boyuan Zheng, Peihang Li, Fangyu Lei, Ruisheng Cao, Yeqiao Fu, Dongchan Shin, Martin Shin, Jiarui Hu, Yuyan Wang, Jixuan Chen, Yuxiao Ye, Danyang Zhang, Dikang Du, Hao Hu, Huarong Chen, Zaida Zhou, Haotian Yao, Ziwei Chen, Qizheng Gu, Yipu Wang, Heng Wang, Diyi Yang, Victor Zhong, Flood Sung, Y. Charles, Zhilin Yang, Tao Yu

Main category: cs.AI

TL;DR: OpenCUA is an open-source framework for computer-use agents that includes annotation tools, a large-scale dataset (AgentNet), and scalable training pipelines, achieving state-of-the-art performance with 45.0% success rate on OSWorld-Verified benchmark.

DetailsMotivation: As commercial vision-language models become increasingly closed-source while mediating important digital interactions, there's a need for open frameworks to study computer-use agents' capabilities, limitations, and risks.

Method: The framework consists of: (1) annotation infrastructure for capturing human demonstrations, (2) AgentNet dataset spanning 3 OS and 200+ applications, (3) scalable pipeline converting demonstrations to state-action pairs with reflective Chain-of-Thought reasoning.

Result: OpenCUA-72B achieves 45.0% average success rate on OSWorld-Verified, establishing new SOTA among open-source models. The approach generalizes well across domains and benefits from increased test-time computation.

Conclusion: OpenCUA provides comprehensive open foundations for computer-use agent research, with released annotation tools, datasets, code, and models to enable further study of these increasingly important systems.

Abstract: Vision-language models have demonstrated impressive capabilities as computer-use agents (CUAs) capable of automating diverse computer tasks. As their commercial potential grows, critical details of the most capable CUA systems remain closed. As these agents will increasingly mediate digital interactions and execute consequential decisions on our behalf, the research community needs access to open CUA frameworks to study their capabilities, limitations, and risks. To bridge this gap, we propose OpenCUA, a comprehensive open-source framework for scaling CUA data and foundation models. Our framework consists of: (1) an annotation infrastructure that seamlessly captures human computer-use demonstrations; (2) AgentNet, the first large-scale computer-use task dataset spanning 3 operating systems and 200+ applications and websites; (3) a scalable pipeline that transforms demonstrations into state-action pairs with reflective long Chain-of-Thought reasoning that sustain robust performance gains as data scales. Our end-to-end agent models demonstrate strong performance across CUA benchmarks. In particular, OpenCUA-72B achieves an average success rate of 45.0% on OSWorld-Verified, establishing a new state-of-the-art (SOTA) among open-source models. Further analysis confirms that our approach generalizes well across domains and benefits significantly from increased test-time computation. We release our annotation tool, datasets, code, and models to build open foundations for further CUA research.

[622] Non-Interactive Symbolic-Aided Chain-of-Thought for Logical Reasoning

Phuong Minh Nguyen, Tien Huu Dang, Naoya Inoue

Main category: cs.AI

TL;DR: Symbolic-Aided Chain-of-Thought improves logical reasoning in LLMs by integrating symbolic representations into prompts, enhancing transparency and performance on complex reasoning tasks.

DetailsMotivation: To enhance the transparency, interpretability, and analyzability of LLM logical reasoning while preserving generalizability of standard prompting techniques.

Method: Integrates lightweight symbolic representations into few-shot prompts to structure inference steps with consistent strategy, making reasoning patterns more explicit in non-interactive reasoning processes.

Result: Significantly outperforms conventional CoT on three out of four datasets (ProofWriter, ProntoQA, LogicalDeduction), consistently improves reasoning capabilities across various model sizes, particularly effective in complex reasoning tasks requiring multiple constraints.

Conclusion: Symbolic-Aided CoT is an effective approach that enhances LLM logical reasoning by incorporating symbolic structures, demonstrating superior performance over standard CoT methods on multiple benchmarks.

Abstract: This work introduces Symbolic-Aided Chain-of-Thought (CoT), an improved approach to standard CoT, for logical reasoning in large language models (LLMs). The key idea is to integrate lightweight symbolic representations into few-shot prompts, structuring the inference steps with a consistent strategy to make reasoning patterns more explicit within a non-interactive reasoning process. By incorporating these symbolic structures, Symbolic-Aided CoT preserves the generalizability of standard prompting techniques while enhancing the transparency, interpretability, and analyzability of LLM logical reasoning. Extensive experiments on four well-known logical reasoning benchmarks – ProofWriter, FOLIO, ProntoQA, and LogicalDeduction, which cover diverse reasoning tasks and scenarios – demonstrate the effectiveness of the proposed approach, particularly in complex reasoning tasks that require navigating multiple constraints or rules. Notably, Symbolic-Aided CoT consistently improves LLMs’ reasoning capabilities across various model sizes and significantly outperforms conventional CoT on three out of four datasets, ProofWriter, ProntoQA, and LogicalDeduction.

[623] EvolMathEval: Towards Evolvable Benchmarks for Mathematical Reasoning via Evolutionary Testing

Shengbo Wang, Mingwei Liu, Zike Li, Anji Li, Yanlin Wang, Xin Peng, Zibin Zheng

Main category: cs.AI

TL;DR: EvolMathEval is an automated framework that generates and evolves mathematical benchmarks using evolutionary testing to create challenging problems that better evaluate LLM capabilities.

DetailsMotivation: Existing mathematical reasoning benchmarks become easier over time as LLMs learn from published datasets, limiting accurate evaluation of state-of-the-art model capabilities.

Method: Uses evolutionary testing to automatically generate and evolve mathematical benchmarks through continuous self-iteration, enhancing complexity of existing datasets.

Result: Generated high-difficulty problems that reduced LLM accuracy by average 48%, and identified ‘Pseudo Aha Moment’ phenomenon where LLMs bypass complex reasoning using simplistic conditions, accounting for 77-100% of errors.

Conclusion: EvolMathEval effectively creates challenging mathematical benchmarks that reveal LLM limitations in complex reasoning and provides better evaluation of true model capabilities.

Abstract: The rapid advancement of Large Language Models (LLMs) poses a significant challenge to existing mathematical reasoning benchmarks. However, these benchmarks tend to become easier over time as LLMs can learn from the published benchmarks. This limitation hinder the precise evaluation of the true capabilities of SOTA models. To address this challenge, this paper introduces EvolMathEval, an automated mathematical benchmark generation and evolution framework based on evolutionary testing. Experimental results demonstrate that EvolMathEval can not only generate a large volume of high-difficulty problems through continuous self-iteration, but it can also significantly enhance the complexity of public datasets like GSM8K through evolution, reducing model accuracy by an average of 48%. Deeper investigation reveals that when solving these evolved problems, LLMs tend to bypass complex multi-step logical reasoning by relying on simplistic and fuzzy conditions, consequently leading to incorrect solutions. We define this phenomenon as the ``Pseudo Aha Moment", which we find accounts for 77% to 100% of errors on targeted problems. Code and resources are available at: https://anonymous.4open.science/r/EvolMathEval

[624] Beyond Memorization: Reasoning-Driven Synthesis as a Mitigation Strategy Against Benchmark Contamination

Terry Jingchen Zhang, Gopal Dev, Ning Wang, Nicole Ni, Wenyuan Jiang, Yinya Huang, Bernhard Schölkopf, Mrinmaya Sachan, Zhijing Jin

Main category: cs.AI

TL;DR: This paper presents a framework to evaluate LLM capability using research-level QA synthesized from arXiv papers, finding no significant performance decay near knowledge cutoff dates, suggesting multi-step reasoning mitigates benchmark contamination.

DetailsMotivation: To address concerns about data contamination in LLM benchmarks and determine if static benchmarks measure genuine reasoning versus mere memorization.

Method: Used an infinitely scalable framework to synthesize multi-step reasoning questions from arXiv papers, leveraging temporal structure to test performance before/after knowledge cutoff dates across 4 frontier models.

Result: Consistently found no significant performance decay near knowledge cutoff dates across models of various sizes, developers, and release dates, contrary to previous studies.

Conclusion: Multi-step reasoning required by the synthesis pipeline provides complexity beyond shallow memorization, serving as an effective mitigation strategy against benchmark contamination, advocating for reasoning-driven benchmark construction.

Abstract: Capability evaluation of large language models (LLMs) is increasingly shadowed by rising concerns of data contamination that cast doubts on whether static benchmarks measure genuine reasoning or mere memorization. We present an empirical study using an infinitely scalable framework to synthesize research-level QA directly from arXiv papers, harnessing the natural temporal structure of research publications where performance decay after knowledge cutoffs may indicate potential contamination. We evaluated 4 frontier model represented by 2 models of different knowledge cutoff dates per family on 1,643 multi-step reasoning questions synthesized from 20,277 arXiv papers stratified over 26 months, covering at least 6 months before and after all cutoff dates. Our results consistently showed a lack of significant performance decay near knowledge cutoff dates for models of various sizes, developers, and release dates. We further performed a comparative analysis with previous longitudinal studies that reported significant post-cutoff performance decay using directly retrieved questions based on public data. we hypothesize that the multi-step reasoning required by our synthesis pipeline offered additional complexity that goes deeper than shallow memorization, which effectively serves a mitigation strategy against benchmark contamination. We fully open source our code and dataset to aid reproducibility and advocate for a paradigm shift that prioritize reasoning-driven synthesis to construct benchmarks over simply collecting newly released questions periodically.

[625] ArcMemo: Abstract Reasoning Composition with Lifelong LLM Memory

Matthew Ho, Chen Si, Zhaoxiang Feng, Fangxu Yu, Yichi Yang, Zhijian Liu, Zhiting Hu, Lianhui Qin

Main category: cs.AI

TL;DR: The paper proposes concept-level memory for LLMs that persists reusable abstractions from reasoning traces, enabling test-time continual learning without weight updates and achieving 7.5% performance gain on ARC-AGI benchmark.

DetailsMotivation: Current LLMs discard valuable patterns and insights from long reasoning traces once the context window resets. External memory can persist these discoveries, but existing approaches use instance-based entries that lack reusability and scalability.

Method: Introduces concept-level memory with strategies for abstracting takeaways from solution rollouts and retrieving relevant concepts for new queries. Uses natural language to store modular abstractions that can be selectively integrated into prompts.

Result: Achieves 7.5% relative gain over strong no-memory baseline on ARC-AGI benchmark. Performance continues to scale with inference compute, with abstract concepts being the most consistent memory design. Dynamic memory updates during test-time outperform fixed settings.

Conclusion: Concept-level memory enables effective test-time continual learning through accumulation and abstraction of patterns, supporting self-improvement without weight updates. The approach shows promise for compositional generalization and abstract reasoning tasks.

Abstract: While inference-time scaling enables LLMs to carry out increasingly long and capable reasoning traces, the patterns and insights uncovered during these traces are immediately discarded once the context window is reset for a new query. External memory is a natural way to persist these discoveries, and recent work has shown clear benefits for reasoning-intensive tasks. We see an opportunity to make such memories more broadly reusable and scalable by moving beyond instance-based memory entries (e.g. exact query/response pairs, or summaries tightly coupled with the original problem context) toward concept-level memory: reusable, modular abstractions distilled from solution traces and stored in natural language. For future queries, relevant concepts are selectively retrieved and integrated into the prompt, enabling test-time continual learning without weight updates. Our design introduces new strategies for abstracting takeaways from rollouts and retrieving entries for new queries, promoting reuse and allowing memory to expand with additional experiences. We evaluate on ARC-AGI, a benchmark that stresses compositional generalization and abstract reasoning, making it a natural fit for concept memory. Our method yields a 7.5% relative gain over a strong no-memory baseline with performance continuing to scale with inference compute. We find abstract concepts to be the most consistent memory design, outscoring the baseline at all tested inference compute scales. Moreover, dynamically updating memory during test-time outperforms fixed settings, supporting the hypothesis that accumulating and abstracting patterns enables further solutions in a form of self-improvement. Code is available at https://github.com/matt-seb-ho/arc_memo.

[626] GAMA: A General Anonymizing Multi-Agent System for Privacy Preservation Enhanced by Domain Rules and Disproof Mechanism

Hailong Yang, Renhuo Zhao, Guanjin Wang, Zhaohong Deng

Main category: cs.AI

TL;DR: GAMA is a privacy-preserving multi-agent system that divides agents into private and public spaces, using anonymization mechanisms to protect sensitive data while maintaining task performance through knowledge and logic enhancement modules.

DetailsMotivation: To enable secure utilization of cloud-based LLMs in multi-agent systems when handling private data, addressing the privacy risks of using public LLM services for sensitive information.

Method: Divides workspace into private/public spaces with structured anonymization. Uses Domain-Rule-based Knowledge Enhancement (DRKE) and Disproof-based Logic Enhancement (DLE) to mitigate semantic loss from anonymization.

Result: Outperforms existing baselines on general QA datasets, privacy leakage benchmarks, and customized privacy-related QA datasets in both task accuracy and privacy preservation metrics.

Conclusion: GAMA effectively balances privacy protection and task performance in LLM-based multi-agent systems through its structured anonymization approach and enhancement modules.

Abstract: With the rapid advancement of Large Language Models (LLMs), LLM-based agents exhibit exceptional abilities in understanding and generating natural language, enabling human-like collaboration and information transmission in LLM-based Multi-Agent Systems (MAS). High-performance LLMs are often hosted on web servers in public cloud environments. When tasks involve private data, MAS cannot securely utilize these LLMs without implementing the agentic privacy-preserving mechanism. To address this challenge, we propose a General Anonymizing Multi-Agent System (GAMA), which divides the agents’ workspace into private and public spaces, ensuring privacy through a structured anonymization mechanism. In the private space, agents handle sensitive data, while in the public web space, only anonymized data is utilized. GAMA incorporates two key modules to mitigate semantic loss caused by anonymization: Domain-Rule-based Knowledge Enhancement (DRKE) and Disproof-based Logic Enhancement (DLE). We evaluate GAMA on two general question-answering datasets, a public privacy leakage benchmark, and two customized question-answering datasets related to privacy. The results demonstrate that GAMA outperforms existing baselines on the evaluated datasets in terms of both task accuracy and privacy preservation metrics.

[627] Autonomous Data Agents: A New Opportunity for Smart Data

Yanjie Fu, Dongjie Wang, Wangyang Ying, Xinyuan Wang, Xiangliang Zhang, Huan Liu, Jian Pei

Main category: cs.AI

TL;DR: DataAgents use LLM reasoning to autonomously handle diverse data tasks through task decomposition, action reasoning, and tool calling, transforming complex data into actionable knowledge.

DetailsMotivation: Data preparation and analysis remain labor-intensive and difficult to scale, while data is often not optimally structured for AI utilization. The alignment between AI and data is essential for effective knowledge extraction.

Method: DataAgents integrate LLM reasoning with task decomposition, action reasoning and grounding, and tool calling. They dynamically plan workflows, call powerful tools, and adapt to diverse data tasks at scale.

Result: DataAgents can handle collection, integration, preprocessing, selection, transformation, reweighing, augmentation, reprogramming, repairs, and retrieval, transforming complex unstructured data into coherent actionable knowledge.

Conclusion: DataAgents represent a paradigm shift toward autonomous data-to-knowledge systems. The report calls for advancing workflow optimization, establishing benchmarks, safeguarding privacy, balancing efficiency with scalability, and developing trustworthy guardrails.

Abstract: As data continues to grow in scale and complexity, preparing, transforming, and analyzing it remains labor-intensive, repetitive, and difficult to scale. Since data contains knowledge and AI learns knowledge from it, the alignment between AI and data is essential. However, data is often not structured in ways that are optimal for AI utilization. Moreover, an important question arises: how much knowledge can we pack into data through intensive data operations? Autonomous data agents (DataAgents), which integrate LLM reasoning with task decomposition, action reasoning and grounding, and tool calling, can autonomously interpret data task descriptions, decompose tasks into subtasks, reason over actions, ground actions into python code or tool calling, and execute operations. Unlike traditional data management and engineering tools, DataAgents dynamically plan workflows, call powerful tools, and adapt to diverse data tasks at scale. This report argues that DataAgents represent a paradigm shift toward autonomous data-to-knowledge systems. DataAgents are capable of handling collection, integration, preprocessing, selection, transformation, reweighing, augmentation, reprogramming, repairs, and retrieval. Through these capabilities, DataAgents transform complex and unstructured data into coherent and actionable knowledge. We first examine why the convergence of agentic AI and data-to-knowledge systems has emerged as a critical trend. We then define the concept of DataAgents and discuss their architectural design, training strategies, as well as the new skills and capabilities they enable. Finally, we call for concerted efforts to advance action workflow optimization, establish open datasets and benchmark ecosystems, safeguard privacy, balance efficiency with scalability, and develop trustworthy DataAgent guardrails to prevent malicious actions.

[628] Your Models Have Thought Enough: Training Large Reasoning Models to Stop Overthinking

Jinyi Han, Ying Huang, Ying Liao, Zishang Jiang, Xikun Lu, Haiquan Zhao, Xinyi Wang, Guanghao Zhou, Sihang Jiang, Jiaqing Liang, Weikang Zhou, Zeye Sun, Fei Yu, Yanghua Xiao

Main category: cs.AI

TL;DR: JET trains Large Reasoning Models to proactively terminate unnecessary reasoning steps, improving efficiency without sacrificing accuracy through trajectory truncation and quality-controlled length rewards.

DetailsMotivation: Large Reasoning Models incur substantial computational costs due to deep reasoning, and existing reinforcement learning methods struggle to construct short reasoning paths during rollout.

Method: JET performs trajectory truncation during rollout to expose models to short reasoning paths and uses a quality-controlled length reward to encourage concise reasoning while maintaining correctness.

Result: JET significantly improves reasoning efficiency without sacrificing accuracy, with DeepSeek-Distill-Qwen-1.5B achieving 4.6% accuracy gain while reducing output length by 46.3% on the Olympiad benchmark.

Conclusion: JET enables efficient reasoning by training models to terminate unnecessary reasoning steps early, demonstrating substantial improvements in both efficiency and accuracy.

Abstract: Large Reasoning Models (LRMs) have achieved impressive performance on challenging tasks, yet their deep reasoning often incurs substantial computational costs. To achieve efficient reasoning, existing reinforcement learning methods still struggle to construct short reasoning path during the rollout stage, limiting effective learning. Inspired by Evidence Accumulation Models, we find that LRMs have accumulated sufficient information early in reasoning, making further reasoning steps redundant. Based on this insight, we propose Just-Enough Thinking (JET), which trains models to proactively terminate unnecessary reasoning. JET performs trajectory truncation during rollout to expose the model to short, distributionally consistent reasoning paths. Besides, it uses a quality-controlled length reward to better encourage concise reasoning while maintaining correctness. Extensive experiments demonstrate that JET significantly improves reasoning efficiency without sacrificing accuracy. Especially, DeepSeek-Distill-Qwen-1.5B achieves a 4.6% accuracy gain while reducing output length by 46.3% on the Olympiad benchmark. Our code is available in the GitHub.

[629] Quant Fever, Reasoning Blackholes, Schrodinger’s Compliance, and More: Probing GPT-OSS-20B

Shuyi Lin, Tian Lu, Zikai Wang, Bo Wen, Yibo Zhao, Cheng Tan

Main category: cs.AI

TL;DR: Security evaluation of GPT-OSS-20B reveals multiple failure modes including quant fever, reasoning blackholes, and other vulnerabilities that can be exploited through adversarial attacks.

DetailsMotivation: To systematically evaluate the security vulnerabilities of OpenAI's GPT-OSS-20B model under adversarial conditions using the Jailbreak Oracle tool.

Method: Used the Jailbreak Oracle (JO) systematic LLM evaluation tool to probe GPT-OSS-20B’s behavior under different adversarial conditions, identifying specific failure modes.

Result: Uncovered several critical failure modes: quant fever, reasoning blackholes, Schrodinger’s compliance, reasoning procedure mirage, and chain-oriented prompting that can be exploited on the model.

Conclusion: The study demonstrates severe security vulnerabilities in GPT-OSS-20B that can lead to significant consequences when exploited by adversaries.

Abstract: OpenAI’s GPT-OSS family provides open-weight language models with explicit chain-of-thought (CoT) reasoning and a Harmony prompt format. We summarize an extensive security evaluation of GPT-OSS-20B that probes the model’s behavior under different adversarial conditions. Using the Jailbreak Oracle (JO) [1], a systematic LLM evaluation tool, the study uncovers several failure modes including quant fever, reasoning blackholes, Schrodinger’s compliance, reasoning procedure mirage, and chain-oriented prompting. Experiments demonstrate how these behaviors can be exploited on the GPT-OSS-20B model, leading to severe consequences.

[630] On the Self-awareness of Large Reasoning Models’ Capability Boundaries

Qingjie Zhang, Yujia Fu, Yang Wang, Liu Yan, Tao Wei, Ke Xu, Minlie Huang, Han Qiu

Main category: cs.AI

TL;DR: LRMs can detect their capability boundaries through reasoning confidence patterns and hidden states, enabling optimization strategies that avoid unproductive reasoning while maintaining accuracy.

DetailsMotivation: Current LRMs waste computation on unsolvable problems by reasoning unproductively until context limits, highlighting the need for self-awareness of capability boundaries.

Method: Two monitoring approaches: 1) Reasoning expression monitoring for black-box models (analyzing confidence trajectories), 2) Hidden states monitoring for white-box models (analyzing separability of solvable/unsolvable problems in hidden representations).

Result: Boundary-aware strategies reduce token usage by 62.7-93.6% while maintaining accuracy, significantly improving reliability and efficiency.

Conclusion: LRMs possess inherent capability boundary awareness that can be leveraged through simple monitoring strategies to prevent unproductive reasoning and enhance computational efficiency.

Abstract: Large Reasoning Models (LRMs) have shown impressive performance on complex reasoning tasks such as mathematics, yet they also display misbehaviors that expose their limitations. In particular, when faced with hard questions, LRMs often engage in unproductive reasoning until context limit, producing wrong answers while wasting substantial computation. This phenomenon reflects a fundamental issue: current answering paradigms overlook the relationship between questions and LRMs’ capability boundaries. In this paper, we investigate whether LRMs possess self-awareness of capability boundaries. We begin by an observation that LRMs may know what they cannot solve through expressed reasoning confidence. For black-box models, we find that reasoning expressions reveal boundary signals, with accelerated growing confidence trajectory for solvable problems but convergent uncertainty trajectory for unsolvable ones. For white-box models, we show that hidden states of the last input token encode boundary information, with solvable and unsolvable problems linearly separable even before reasoning begins. Building on these findings, we propose two simple yet effective optimization strategies: reasoning expression monitoring and hidden states monitoring. Experiments demonstrate that these boundary-aware strategies enable LRMs to avoid unproductive reasoning without sacrificing accuracy, significantly improving reliability and efficiency by cutting token usage up to 62.7 - 93.6%.

[631] Pushing LLMs to Their Logical Reasoning Bound: The Role of Data Reasoning Intensity

Zhen Bi, Zhenlin Hu, Jinnan Yang, Mingyang Chen, Cheng Deng, Yida Xue, Zeyu Yang, Qing Shen, Zhenfang Liu, Kang Zhao, Ningyu Zhang, Jungang Lou

Main category: cs.AI

TL;DR: The paper introduces Data Reasoning Intensity (DRI) to measure logical reasoning complexity in training data and proposes a re-cognizing optimization strategy to enhance LLM reasoning performance by aligning data complexity with model capacity.

DetailsMotivation: Current LLM training approaches focus on data format transformation but neglect internal reasoning complexity, leaving reasoning potential underutilized. The authors believe LLM reasoning performance is constrained by both training data potential and model cognitive capacity.

Method: Introduces Data Reasoning Intensity (DRI) metric to quantify logical reasoning complexity by decomposing and aggregating logical structures. Proposes a re-cognizing optimization strategy that systematically enhances logical reasoning intensity of training data to better align with LLM’s reasoning boundary.

Result: Extensive experiments show significant improvements in performance and generalization over data-centric strategies. Method validated under reinforcement learning framework demonstrates enhanced reasoning capabilities.

Conclusion: Prioritizing reasoning complexity in data rather than sheer scale or superficial form is essential to realizing LLMs’ full cognitive potential.

Abstract: Recent advances in large language models (LLMs) highlight the importance of training data structure and quality in shaping reasoning behavior. However, most existing approaches focus on transforming data formats while neglecting the internal reasoning complexity of training samples, leaving the reasoning potential of data under-explored and underutilized. In this work, we posit that LLM logical reasoning performance is jointly constrained by the potential of the training data and the cognitive capacity of the model. To make this relationship measurable, we introduce Data Reasoning Intensity (DRI), a novel metric that quantifies the latent logical reasoning complexity of samples by decomposing and aggregating their logical structures. This allows us to analyze how well current LLMs utilize logical reasoning signals and identify performance gaps relative to data potential. Based on this insight, we introduce a re-cognizing optimization strategy that systematically enhances the logical reasoning intensity of training data. Rather than increasing data volume, our method re-optimizes existing samples to better align with the LLM’s logical reasoning boundary. Extensive experiments show that our approach significantly improves performance and generalization over data-centric strategies. We further validate our method under a reinforcement learning framework. Our results indicate that prioritizing reasoning complexity in data rather than sheer scale or superficial form is essential to realizing LLMs’ full cognitive potential.

[632] Structural Reward Model: Enhancing Interpretability, Efficiency, and Scalability in Reward Modeling

Xiaoyu Liu, Di Liang, Chang Dai, Hongyu Shan, Peiyang Liu, Yonghao Liu, Muling Wu, Yuntao Li, Xianjie Wu, LI Miao, Jiangrong Shen, Minlong Peng

Main category: cs.AI

TL;DR: The paper proposes Structural Reward Models (SRMs) as a modular and interpretable alternative to traditional scalar RMs and generative RMs, addressing limitations in contextual evaluation and efficiency for industrial applications.

DetailsMotivation: Traditional scalar RMs struggle with contextual information and generative RMs suffer from black-box nature and inefficiency. Industrial scenarios need structured feedback for dimension-specific diagnostics and optimization.

Method: SRM uses a modular framework with side-branch models as auxiliary feature generators, introducing fine-grained dimensions for interpretable evaluation.

Result: SRMs outperform scalar RMs and GRMs in robustness and alignment with human preferences, supporting efficient optimization for practical scenarios.

Conclusion: SRM provides a practical, adaptable, and scalable reward modeling solution suitable for industrial deployment in applications like search and recommendation systems.

Abstract: Reward Models (RMs) are key components for evaluating and guiding language model outputs. However, traditional scalar RMs often struggle with incorporating contextual and background information during inference, leading to incomplete evaluations. Generative RMs (GRMs) attempt to address these limitations by generating intermediate reasoning steps. Yet, their uncontrolled black-box nature and inefficiency due to sequential decoding hinder their industrial deployment. Industrial scenarios, such as search and recommendation systems, often involve single-domain tasks requiring evaluation along specific dimensions. In such contexts, diagnosing “bad cases” necessitates structured feedback to identify and optimize dimension-specific issues. In this paper, we propose the Structural Reward Model (SRM), a modular and interpretable framework integrating side-branch models as auxiliary feature generators. By introducing fine-grained dimensions, SRMs enable interpretable and efficient evaluation, facilitating targeted diagnostics and optimization. This structured approach ensures adaptability and scalability for industrial applications. Through comprehensive experiments, we demonstrate that SRMs outperform scalar RMs and GRMs in robustness and alignment with human preferences. The modular design further supports efficient optimization for practical scenarios, allowing SRM to provide a practical reward modeling solution for industry.

[633] From Perception to Cognition: A Survey of Vision-Language Interactive Reasoning in Multimodal Large Language Models

Chenyue Zhou, Mingxuan Wang, Yanbiao Ma, Chenxu Wu, Wanyi Chen, Zhe Qian, Xinyu Liu, Yiwei Zhang, Junhao Wang, Hengbo Xu, Fei Luo, Xiaohua Chen, Xiaoshuai Hao, Hehan Li, Andi Zhang, Wenxuan Wang, Lingling Li, Zhiwu Lu, Yang Lu, Yike Guo

Main category: cs.AI

TL;DR: This survey introduces a “From Perception to Cognition” framework to analyze MLLMs’ disconnect between visual perception and reasoning, addressing hallucination and proposing future directions.

DetailsMotivation: MLLMs exhibit shallow integration between perception and cognition, leading to reasoning failures like hallucination, revealing the need for coherent internal world models.

Method: Proposes a unified analytical framework dividing vision-language understanding into Perception (visual information extraction) and Cognition (multi-step reasoning with observe-think-verify loop).

Result: Systematically analyzes current MLLM bottlenecks, surveys enhancement techniques from visual representations to reasoning paradigms, and reviews benchmarks.

Conclusion: Provides structured perspective on MLLM limitations and illuminates path toward next-generation models with deep reasoning and genuine world understanding.

Abstract: Multimodal Large Language Models (MLLMs) strive to achieve a profound, human-like understanding of and interaction with the physical world, but often exhibit a shallow and incoherent integration when acquiring information (Perception) and conducting reasoning (Cognition). This disconnect leads to a spectrum of reasoning failures, with hallucination being the most prominent. Collectively, these issues expose a fundamental challenge: the ability to process pixels does not yet confer the ability to construct a coherent, credible internal world model. To systematically dissect and address this challenge, this survey introduces a novel and unified analytical framework: ``From Perception to Cognition." We deconstruct the complex process of vision-language interactive understanding into two interdependent layers: Perception, the foundational ability to accurately extract visual information and achieve fine-grained alignment with textual instructions; and Cognition, the higher-order capability for proactive, multi-step, goal-oriented reasoning built upon this perceptual foundation, the core of which is the formation of a dynamic observe-think-verify reasoning loop. Guided by this framework, this paper systematically analyzes the key bottlenecks of current MLLMs at both layers. It surveys the landscape of cutting-edge methods designed to address these challenges, spanning from techniques that enhance low-level visual representations to those that improve high-level reasoning paradigms. Furthermore, we review critical benchmarks and delineate future research directions. This survey aims to provide the research community with a clear, structured perspective for understanding the intrinsic limitations of current MLLMs and to illuminate the path toward building next-generation models capable of deep reasoning and a genuine understanding of the world.

[634] Transformer Classification of Breast Lesions: The BreastDCEDL_AMBL Benchmark Dataset and 0.92 AUC Baseline

Naomi Fridman, Anat Goldstein

Main category: cs.AI

TL;DR: Transformer-based framework for automated breast lesion classification in DCE-MRI achieves 0.92 AUC, potentially eliminating 33% of unnecessary biopsies while maintaining 100% sensitivity.

DetailsMotivation: Breast MRI has poor specificity leading to high false-positive rates and unnecessary biopsies, creating need for better automated classification methods.

Method: Implemented SegFormer architecture with semantic segmentation to quantify malignant pixel distribution, trained on curated BreastDCEDL_AMBL dataset with 88 patients and 133 annotated lesions.

Result: Achieved 0.92 AUC for lesion-level classification, 100% sensitivity and 67% specificity at patient level, potentially eliminating one-third of unnecessary biopsies without missing malignancies.

Conclusion: Framework provides interpretable spatial predictions and establishes first standardized benchmark for DCE-MRI lesion classification, enabling clinical deployment advancement.

Abstract: Breast magnetic resonance imaging is a critical tool for cancer detection and treatment planning, but its clinical utility is hindered by poor specificity, leading to high false-positive rates and unnecessary biopsies. This study introduces a transformer-based framework for automated classification of breast lesions in dynamic contrast-enhanced MRI, addressing the challenge of distinguishing benign from malignant findings. We implemented a SegFormer architecture that achieved an AUC of 0.92 for lesion-level classification, with 100% sensitivity and 67% specificity at the patient level - potentially eliminating one-third of unnecessary biopsies without missing malignancies. The model quantifies malignant pixel distribution via semantic segmentation, producing interpretable spatial predictions that support clinical decision-making. To establish reproducible benchmarks, we curated BreastDCEDL_AMBL by transforming The Cancer Imaging Archive’s AMBL collection into a standardized deep learning dataset with 88 patients and 133 annotated lesions (89 benign, 44 malignant). This resource addresses a key infrastructure gap, as existing public datasets lack benign lesion annotations, limiting benign-malignant classification research. Training incorporated an expanded cohort of over 1,200 patients through integration with BreastDCEDL datasets, validating transfer learning approaches despite primary tumor-only annotations. Public release of the dataset, models, and evaluation protocols provides the first standardized benchmark for DCE-MRI lesion classification, enabling methodological advancement toward clinical deployment.

[635] Is It Thinking or Cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort

Xinpeng Wang, Nitish Joshi, Barbara Plank, Rico Angell, He He

Main category: cs.AI

TL;DR: TRACE detects implicit reward hacking in reasoning models by measuring how early in the chain-of-thought the model achieves high reward, identifying shortcuts that bypass traditional monitors.

DetailsMotivation: Reward hacking poses a significant threat where models exploit loopholes in reward functions without solving the intended task, and implicit hacking bypasses chain-of-thought monitors by appearing benign.

Method: TRACE progressively truncates a model’s chain-of-thought at various lengths, forces the model to answer, and estimates expected reward at each cutoff. It quantifies effort by measuring when reasoning becomes sufficient to obtain reward.

Result: TRACE achieves over 65% gains over a 72B CoT monitor in math reasoning and over 30% gains over a 32B monitor in coding. It also discovers unknown loopholes during training.

Conclusion: TRACE offers a scalable unsupervised approach for oversight where current monitoring methods prove ineffective against implicit reward hacking.

Abstract: Reward hacking, where a reasoning model exploits loopholes in a reward function to achieve high rewards without solving the intended task, poses a significant threat. This behavior may be explicit, i.e. verbalized in the model’s chain-of-thought (CoT), or implicit, where the CoT appears benign thus bypasses CoT monitors. To detect implicit reward hacking, we propose TRACE (Truncated Reasoning AUC Evaluation). Our key observation is that hacking occurs when exploiting the loophole is easier than solving the actual task. This means that the model is using less ’effort’ than required to achieve high reward. TRACE quantifies effort by measuring how early a model’s reasoning becomes sufficient to obtain the reward. We progressively truncate a model’s CoT at various lengths, force the model to answer, and estimate the expected reward at each cutoff. A hacking model, which takes a shortcut, will achieve a high expected reward with only a small fraction of its CoT, yielding a large area under the accuracy-vs-length curve. TRACE achieves over 65% gains over our strongest 72B CoT monitor in math reasoning, and over 30% gains over a 32B monitor in coding. We further show that TRACE can discover unknown loopholes during training. Overall, TRACE offers a scalable unsupervised approach for oversight where current monitoring methods prove ineffective.

[636] Agentic Additive Manufacturing Alloy Discovery

Peter Pak, Achuth Chandrasekhar, Amir Barati Farimani

Main category: cs.AI

TL;DR: LLM-enabled multi-agent system automates alloy discovery in additive manufacturing by using tools like Thermo-Calc and process map generation to analyze printability and make autonomous decisions.

DetailsMotivation: Alloy discovery in additive manufacturing is complex and requires expertise across multiple domains. Agentic systems can augment researchers' abilities and accelerate this process.

Method: Developed a multi-agent system using LLMs that dispatch tool calls via Model Context Protocol (MCP) to perform thermodynamic simulations (Thermo-Calc) and generate process maps, with dynamic task adjustment based on results.

Result: The system can effectively reason through complex user prompts, provide analysis on alloy printability, and make autonomous decisions by dynamically adjusting task trajectories.

Conclusion: LLM-enabled multi-agent systems can automate and accelerate alloy discovery in additive manufacturing, demonstrating the benefits of adopting such systems for complex materials science challenges.

Abstract: Agentic systems enable the intelligent use of research tooling, augmenting a researcher’s ability to investigate and propose novel solutions to existing problems. Within Additive Manufacturing (AM), alloy discovery remains a complex challenge, often requiring expertise in the various domains of materials science, thermodynamic simulations, and experimental analysis. Large Language Model (LLM) enabled agents can facilitate this endeavor by utilizing their extensive knowledge base to dispatch tool calls via Model Context Protocol (MCP) to perform actions such as Thermo-Calc property diagram calculations and lack of fusion process map generation. In addition, the multi-agent system developed in this work is able to effectively reason through complex user prompts and provide analysis on the printability of proposed alloys. These agents can dynamically adjust their task trajectory to the outcomes of tool call results, effectively enabling autonomous decision-making in practical environments. This work aims to utilize LLM enabled agents to automate and accelerate the task of alloy discovery within the field of additive manufacturing and showcase the benefits of adopting this multi-agent system.

[637] Mitigating Modal Imbalance in Multimodal Reasoning

Chen Henry Wu, Neil Kale, Aditi Raghunathan

Main category: cs.AI

TL;DR: Foundation models struggle with cross-modal reasoning, showing poor performance (as low as 3%) when conflicting evidence is split across modalities due to attention imbalance, but simple training modifications can significantly improve this.

DetailsMotivation: To understand how well foundation models perform joint reasoning across multiple modalities, especially when modalities interact and form cross-modal context, by studying cross-modal conflicts where conflicting evidence appears across different modalities.

Method: Study cross-modal conflicts where conflicting evidence is presented across modalities, analyze cross-modal attention imbalance, and test a simple method of explicitly combining multiple modalities within each training instance.

Result: FMs recognize conflicts in unimodal contexts 90% of the time but fall to as low as 3% when evidence is split across modalities. Cross-modal attention imbalance causes this failure, with FMs disproportionately prioritizing certain modalities. The proposed training method significantly reduces attention imbalance and improves downstream performance on vision-language benchmarks.

Conclusion: Cross-modal attention imbalance is a critical issue that doesn’t resolve with simple dataset scaling, and systematic approaches to address cross-modal contexts are essential for building reliable foundation models.

Abstract: Foundation models (FMs) deployed in real-world tasks such as computer-use agents must integrate diverse modalities. How good are FMs at performing joint reasoning, simultaneously reasoning over multiple modalities, especially when the modalities interact and relate to each other to form cross-modal context? To better understand this problem, we study FMs on cross-modal conflicts: scenarios where conflicting evidence is presented across modalities. This allows us to examine whether FMs prioritize one modality over another or reason jointly to reconcile the conflict. Our experiments reveal that FMs can recognize conflicts in unimodal contexts, composed of a single modality, 90% of the time, but the ratio falls as low as 3% when evidence is split across modalities – similar observations hold in cross-lingual contexts, composed of multiple languages. We trace this failure to cross-modal attention imbalance, showing that FMs exhibit extreme asymmetry in attention scores, disproportionately prioritizing certain modalities. We show that cross-modal attention imbalance does not go away by simply scaling up multimodal or multilingual datasets blindly, since they lack training examples that explicitly require cross-modal reasoning. We demonstrate that even a simple and scalable method of explicitly combining multiple modalities within each training instance significantly reduces attention imbalance. Reduced attention imbalance directly translates to improved downstream performance on several vision-language benchmarks. Our findings underscore the importance of systematically addressing cross-modal contexts to build reliable foundation models.

cs.SD

[638] Linguistic and Audio Embedding-Based Machine Learning for Alzheimer’s Dementia and Mild Cognitive Impairment Detection: Insights from the PROCESS Challenge

Adharsha Sam Edwin Sam Devahi, Sohail Singh Sangha, Prachee Priyadarshinee, Jithin Thilakan, Ivan Fu Xing Tan, Christopher Johann Clarke, Sou Ka Lon, Balamurali B T, Yow Wei Quin, Chen Jer-Ming

Main category: cs.SD

TL;DR: A machine learning framework using both acoustic (Whisper embeddings) and linguistic features from speech to detect Alzheimer’s Dementia and Mild Cognitive Impairment, achieving competitive results in the PROCESS Challenge.

DetailsMotivation: Early detection of Alzheimer's Dementia and Mild Cognitive Impairment is critical but current methods are resource-intensive and invasive. Speech offers a promising non-invasive biomarker for cognitive decline.

Method: Used Whisper embeddings for audio features and extracted linguistic features (pronoun usage, syntactic complexity, filler words, clause structure) from transcriptions of Semantic Fluency, Phonemic Fluency, and Cookie Theft picture description tasks. Built classification models for HC/MCI/AD distinction and regression models for MMSE score prediction.

Result: Voted ensemble models with linguistic features achieved best classification (F1 = 0.497). Whisper embedding-based ensemble regressors yielded lowest MMSE prediction error (RMSE = 2.843). Models placed among top submissions in regression and mid-range for classification in PROCESS Challenge.

Conclusion: Multimodal speech-based approaches show promise for scalable, non-invasive cognitive assessment. Integration of task-specific linguistic and acoustic markers is important for dementia detection.

Abstract: Early detection of Alzheimer’s Dementia (AD) and Mild Cognitive Impairment (MCI) is critical for timely intervention, yet current diagnostic approaches remain resource-intensive and invasive. Speech, encompassing both acoustic and linguistic dimensions, offers a promising non-invasive biomarker for cognitive decline. In this study, we present a machine learning framework for the PROCESS Challenge, leveraging both audio embeddings and linguistic features derived from spontaneous speech recordings. Audio representations were extracted using Whisper embeddings from the Cookie Theft description task, while linguistic features-spanning pronoun usage, syntactic complexity, filler words, and clause structure-were obtained from transcriptions across Semantic Fluency, Phonemic Fluency, and Cookie Theft picture description. Classification models aimed to distinguish between Healthy Controls (HC), MCI, and AD participants, while regression models predicted Mini-Mental State Examination (MMSE) scores. Results demonstrated that voted ensemble models trained on concatenated linguistic features achieved the best classification performance (F1 = 0.497), while Whisper embedding-based ensemble regressors yielded the lowest MMSE prediction error (RMSE = 2.843). Comparative evaluation within the PROCESS Challenge placed our models among the top submissions in regression task, and mid-range for classification, highlighting the complementary strengths of linguistic and audio embeddings. These findings reinforce the potential of multimodal speech-based approaches for scalable, non-invasive cognitive assessment and underline the importance of integrating task-specific linguistic and acoustic markers in dementia detection.

[639] Audio Forensics Evaluation (SAFE) Challenge

Kirill Trapeznikov, Paul Cummer, Pranay Pherwani, Jai Aslam, Michael S. Davinroy, Peter Bautista, Laura Cassani, Matthew Stamm, Jill Crisman

Main category: cs.SD

TL;DR: The SAFE Challenge is a blind evaluation framework for benchmarking synthetic audio detection models across three progressively difficult scenarios: raw synthetic speech, processed audio, and laundered audio designed to evade forensic analysis.

DetailsMotivation: The increasing realism of synthetic speech from advanced TTS models, combined with post-processing and laundering techniques, poses significant challenges for audio forensic detection systems.

Method: Created a comprehensive blind evaluation framework with 90 hours of audio (21,000 samples) from 21 real sources and 17 TTS models, testing across three tasks of increasing difficulty.

Result: The challenge provides initial insights into the strengths and limitations of current synthetic audio detection approaches through systematic benchmarking.

Conclusion: SAFE offers a foundational framework for advancing synthetic audio detection research by establishing standardized evaluation protocols for increasingly realistic synthetic speech threats.

Abstract: The increasing realism of synthetic speech generated by advanced text-to-speech (TTS) models, coupled with post-processing and laundering techniques, presents a significant challenge for audio forensic detection. In this paper, we introduce the SAFE (Synthetic Audio Forensics Evaluation) Challenge, a fully blind evaluation framework designed to benchmark detection models across progressively harder scenarios: raw synthetic speech, processed audio (e.g., compression, resampling), and laundered audio intended to evade forensic analysis. The SAFE challenge consisted of a total of 90 hours of audio and 21,000 audio samples split across 21 different real sources and 17 different TTS models and 3 tasks. We present the challenge, evaluation design and tasks, dataset details, and initial insights into the strengths and limitations of current approaches, offering a foundation for advancing synthetic audio detection research. More information is available at \href{https://stresearch.github.io/SAFE/}{https://stresearch.github.io/SAFE/}.

[640] Language Model Based Text-to-Audio Generation: Anti-Causally Aligned Collaborative Residual Transformers

Juncheng Wang, Chao Xu, Cheng Yu, Zhe Hu, Haoyu Xie, Guoqi Yu, Lei Shang, Shujun Wang

Main category: cs.SD

TL;DR: Siren is a novel LM-based framework that addresses limitations of RVQ tokenizers in text-to-audio generation by using multiple isolated transformers with causal conditioning and anti-causal alignment via reinforcement learning, achieving state-of-the-art performance.

DetailsMotivation: Language models with RVQ tokenizers lag behind diffusion-based models in text-to-audio generation due to a dilemma: more RVQ layers improve reconstruction fidelity but exceed LM capacity. Two key limitations were identified: orthogonality of features across RVQ layers and descending semantic richness in deeper layers.

Method: Proposed Siren framework using multiple isolated transformers with causal conditioning and anti-causal alignment via reinforcement learning to address RVQ limitations.

Result: Extensive experiments show Siren outperforms both existing LM-based and diffusion-based T2A systems, achieving state-of-the-art results.

Conclusion: Siren bridges LM representational strengths with audio synthesis fidelity demands, repositioning LMs as competitive contenders against diffusion models in T2A tasks and enabling unified multi-modal generation frameworks.

Abstract: While language models (LMs) paired with residual vector quantization (RVQ) tokenizers have shown promise in text-to-audio (T2A) generation, they still lag behind diffusion-based models by a non-trivial margin. We identify a critical dilemma underpinning this gap: incorporating more RVQ layers improves audio reconstruction fidelity but exceeds the generation capacity of conventional LMs. To address this, we first analyze RVQ dynamics and uncover two key limitations: 1) orthogonality of features across RVQ layers hinders effective LMs training, and 2) descending semantic richness in tokens from deeper RVQ layers exacerbates exposure bias during autoregressive decoding. Based on these insights, we propose Siren, a novel LM-based framework that employs multiple isolated transformers with causal conditioning and anti-causal alignment via reinforcement learning. Extensive experiments demonstrate that Siren outperforms both existing LM-based and diffusion-based T2A systems, achieving state-of-the-art results. By bridging the representational strengths of LMs with the fidelity demands of audio synthesis, our approach repositions LMs as competitive contenders against diffusion models in T2A tasks. Moreover, by aligning audio representations with linguistic structures, Siren facilitates a promising pathway toward unified multi-modal generation frameworks.

[641] Lightweight and Generalizable Acoustic Scene Representations via Contrastive Fine-Tuning and Distillation

Kuang Yuan, Yang Gao, Xilin Li, Xinhao Mei, Syavosh Zadissa, Tarun Pruthi, Saeed Bagheri Sereshki

Main category: cs.SD

TL;DR: ContrastASC enables acoustic scene classification models to adapt to new categories without retraining by learning generalizable representations through supervised contrastive fine-tuning and distillation.

DetailsMotivation: Current ASC models on edge devices lack transferability to new acoustic categories, limiting real-world applicability where adaptation to refined or unseen scenes is needed.

Method: Combines supervised contrastive fine-tuning of pre-trained models with contrastive representation distillation to structure embedding space and transfer knowledge to compact student models.

Result: Shows improved few-shot adaptation to unseen categories while maintaining strong closed-set classification performance.

Conclusion: ContrastASC provides a practical solution for edge devices to handle evolving acoustic scene categories without requiring model retraining.

Abstract: Acoustic scene classification (ASC) models on edge devices typically operate under fixed class assumptions, lacking the transferability needed for real-world applications that require adaptation to new or refined acoustic categories. We propose ContrastASC, which learns generalizable acoustic scene representations by structuring the embedding space to preserve semantic relationships between scenes, enabling adaptation to unseen categories without retraining. Our approach combines supervised contrastive fine-tuning of pre-trained models with contrastive representation distillation to transfer this structured knowledge to compact student models. Our evaluation shows that ContrastASC demonstrates improved few-shot adaptation to unseen categories while maintaining strong closed-set performance.

[642] Soft Disentanglement in Frequency Bands for Neural Audio Codecs

Benoit Ginies, Xiaoyu Bie, Olivier Fercoq, Gaël Richard

Main category: cs.SD

TL;DR: A generalizable neural architecture for learning disentangled audio features using spectral decomposition and multi-branch audio codec.

DetailsMotivation: Existing disentanglement methods are often data/task-dependent, lacking generalizability for audio feature extraction.

Method: Spectral decomposition of time-domain signals followed by multi-branch audio codec operating on decomposed components.

Result: Better reconstruction and perceptual performance than state-of-the-art baseline, with advantages for inpainting tasks.

Conclusion: The proposed approach provides effective disentangled audio representations with improved performance and task versatility.

Abstract: In neural-based audio feature extraction, ensuring that representations capture disentangled information is crucial for model interpretability. However, existing disentanglement methods often rely on assumptions that are highly dependent on data characteristics or specific tasks. In this work, we introduce a generalizable approach for learning disentangled features within a neural architecture. Our method applies spectral decomposition to time-domain signals, followed by a multi-branch audio codec that operates on the decomposed components. Empirical evaluations demonstrate that our approach achieves better reconstruction and perceptual performance compared to a state-of-the-art baseline while also offering potential advantages for inpainting tasks.

[643] Désentrelacement Fréquentiel Doux pour les Codecs Audio Neuronaux

Benoît Giniès, Xiaoyu Bie, Olivier Fercoq, Gaël Richard

Main category: cs.SD

TL;DR: Proposes a disentangled neural audio codec using spectral decomposition to improve interpretability of learned representations, achieving better reconstruction and perceptual quality than state-of-the-art baselines.

DetailsMotivation: Address the challenge of interpretability in neural audio feature extraction, as current disentanglement techniques often depend on specific datasets or task formulations.

Method: Leverages spectral decomposition of time-domain signals to impose structure on extracted tokens in a disentangled neural audio codec.

Result: Experimental evaluations show the method surpasses state-of-the-art baseline in both reconstruction fidelity and perceptual quality.

Conclusion: The proposed spectral decomposition-based approach effectively enhances interpretability while maintaining high audio quality.

Abstract: While neural-based models have led to significant advancements in audio feature extraction, the interpretability of the learned representations remains a critical challenge. To address this, disentanglement techniques have been integrated into discrete neural audio codecs to impose structure on the extracted tokens. However, these approaches often exhibit strong dependencies on specific datasets or task formulations. In this work, we propose a disentangled neural audio codec that leverages spectral decomposition of time-domain signals to enhance representation interpretability. Experimental evaluations demonstrate that our method surpasses a state-of-the-art baseline in both reconstruction fidelity and perceptual quality.

[644] GDiffuSE: Diffusion-based speech enhancement with noise model guidance

Efrayim Yanir, David Burshtein, Sharon Gannot

Main category: cs.SD

TL;DR: GDiffuSE: A diffusion-based speech enhancement method that uses a lightweight helper model to estimate noise distribution, improving robustness to unseen noise types by leveraging pre-trained speech generation models.

DetailsMotivation: To create a more robust speech enhancement system that can handle unseen noise types, moving beyond conventional direct mapping approaches by incorporating diffusion models originally trained for speech generation.

Method: Uses a denoising diffusion probabilistic model (DDPM) with a guidance mechanism where a lightweight helper model estimates noise distribution to guide the diffusion denoising process.

Result: Consistent improvements over state-of-the-art baselines under mismatched noise conditions, showing better adaptation to unseen noise types.

Conclusion: The guided diffusion approach enables robust speech enhancement by effectively leveraging pre-trained speech generation models and adapting to various noise conditions through noise distribution estimation.

Abstract: This paper introduces a novel speech enhancement (SE) approach based on a denoising diffusion probabilistic model (DDPM), termed Guided diffusion for speech enhancement (GDiffuSE). In contrast to conventional methods that directly map noisy speech to clean speech, our method employs a lightweight helper model to estimate the noise distribution, which is then incorporated into the diffusion denoising process via a guidance mechanism. This design improves robustness by enabling seamless adaptation to unseen noise types and by leveraging large-scale DDPMs originally trained for speech generation in the context of SE. We evaluate our approach on noisy signals obtained by adding noise samples from the BBC sound effects database to LibriSpeech utterances, showing consistent improvements over state-of-the-art baselines under mismatched noise conditions. Examples are available at our project webpage.

[645] Machine Unlearning in Speech Emotion Recognition via Forget Set Alone

Zhao Ren, Rathi Adarshi Rammohan, Kevin Scheck, Tanja Schultz

Main category: cs.SD

TL;DR: A novel adversarial-attack-based approach for machine unlearning in speech emotion recognition that removes knowledge of forgotten data while maintaining model performance, using only the data to be forgotten.

DetailsMotivation: Speech emotion recognition handles sensitive data where users may request data deletion due to privacy concerns, but current unlearning methods require additional data and substantial computational resources.

Method: Proposes an adversarial-attack-based approach that fine-tunes a pre-trained speech emotion recognition model using only the data to be forgotten, without requiring other data.

Result: The approach effectively removes knowledge of forgotten data from the model while preserving high performance on emotion recognition test sets.

Conclusion: The proposed method provides an efficient solution for machine unlearning in speech emotion recognition that addresses privacy concerns without the computational burden of traditional approaches.

Abstract: Speech emotion recognition aims to identify emotional states from speech signals and has been widely applied in human-computer interaction, education, healthcare, and many other fields. However, since speech data contain rich sensitive information, partial data can be required to be deleted by speakers due to privacy concerns. Current machine unlearning approaches largely depend on data beyond the samples to be forgotten. However, this reliance poses challenges when data redistribution is restricted and demands substantial computational resources in the context of big data. We propose a novel adversarial-attack-based approach that fine-tunes a pre-trained speech emotion recognition model using only the data to be forgotten. The experimental results demonstrate that the proposed approach can effectively remove the knowledge of the data to be forgotten from the model, while preserving high model performance on the test set for emotion recognition.

[646] Pitch-Conditioned Instrument Sound Synthesis From an Interactive Timbre Latent Space

Christian Limberg, Fares Schulz, Zhe Zhang, Stefan Weinzierl

Main category: cs.SD

TL;DR: A two-stage semi-supervised learning framework for neural instrument sound synthesis that generates pitch-accurate, high-quality music samples from an expressive timbre latent space.

DetailsMotivation: Existing high-quality synthesis methods use high-dimensional latent representations that are difficult to navigate and provide poor user experience.

Method: Two-stage training: first train a pitch-timbre disentangled 2D representation using Variational Autoencoder, then use this as conditioning for a Transformer-based generative model.

Result: The method effectively learns a disentangled timbre space enabling expressive and controllable audio generation with reliable pitch conditioning, capturing subtle timbre variations while maintaining pitch accuracy.

Conclusion: The approach provides an intuitive interface for sound navigation and represents a step towards future music production environments that are both intuitive and creatively empowering.

Abstract: This paper presents a novel approach to neural instrument sound synthesis using a two-stage semi-supervised learning framework capable of generating pitch-accurate, high-quality music samples from an expressive timbre latent space. Existing approaches that achieve sufficient quality for music production often rely on high-dimensional latent representations that are difficult to navigate and provide unintuitive user experiences. We address this limitation through a two-stage training paradigm: first, we train a pitch-timbre disentangled 2D representation of audio samples using a Variational Autoencoder; second, we use this representation as conditioning input for a Transformer-based generative model. The learned 2D latent space serves as an intuitive interface for navigating and exploring the sound landscape. We demonstrate that the proposed method effectively learns a disentangled timbre space, enabling expressive and controllable audio generation with reliable pitch conditioning. Experimental results show the model’s ability to capture subtle variations in timbre while maintaining a high degree of pitch accuracy. The usability of our method is demonstrated in an interactive web application, highlighting its potential as a step towards future music production environments that are both intuitive and creatively empowering: https://pgesam.faresschulz.com

[647] Evaluating Self-Supervised Speech Models via Text-Based LLMS

Takashi Maekaku, Keita Goto, Jinchuan Tian, Yusuke Shinohara, Shinji Watanabe

Main category: cs.SD

TL;DR: Proposes using LLMs to evaluate SSL models by computing mean log-likelihood on token sequences with domain cues, eliminating need for extra training or hyperparameter tuning.

DetailsMotivation: Current SSL evaluation methods require costly downstream task training and hyperparameter tuning, making assessment expensive and impractical.

Method: Use LLMs to compute mean log-likelihood on discrete token sequences from SSL models with minimal domain cues that guide in-context learning.

Result: LLM-based scores correlate with automatic speech recognition performance and provide useful embeddings for speaker verification tasks.

Conclusion: LLMs can effectively evaluate SSL models without extra training and also serve as inference-time embedding providers for downstream tasks.

Abstract: Self-Supervised Learning (SSL) has gained traction for its ability to learn rich representations with low labeling costs, applicable across diverse downstream tasks. However, assessing the downstream-task performance remains challenging due to the cost of extra training and evaluation. Existing methods for task-agnostic evaluation also require extra training or hyperparameter tuning. We propose a novel evaluation metric using large language models (LLMs). By inputting discrete token sequences and minimal domain cues derived from SSL models into LLMs, we obtain the mean log-likelihood; these cues guide in-context learning, rendering the score more reliable without extra training or hyperparameter tuning. Experimental results show a correlation between LLM-based scores and automatic speech recognition task. Additionally, our findings reveal that LLMs not only functions as an SSL evaluation tools but also provides inference-time embeddings that are useful for speaker verification task.

[648] A Study on the Data Distribution Gap in Music Emotion Recognition

Joann Ching, Gerhard Widmer

Main category: cs.SD

TL;DR: The paper addresses Music Emotion Recognition (MER) by investigating genre diversity and dataset biases across five datasets, proposing a simple framework combining Jukebox embeddings and chroma features for improved cross-dataset generalization.

DetailsMotivation: Prior MER studies focus on specific musical styles rather than diverse genres, and there's a need to address out-of-distribution generalization and dataset biases in emotion recognition from audio content.

Method: Systematic experiments across five datasets (EmoMusic, DEAM, PMEmo, WTC, WCMED) using multiple data and feature sets, combining Jukebox model embeddings with chroma features, and training with diverse datasets.

Result: The proposed framework demonstrates substantially improved cross-dataset generalization capabilities by addressing genre dominance and dataset biases in feature representations.

Conclusion: A simple yet effective framework combining Jukebox embeddings with chroma features, trained on diverse datasets, significantly enhances cross-dataset generalization in Music Emotion Recognition.

Abstract: Music Emotion Recognition (MER) is a task deeply connected to human perception, relying heavily on subjective annotations collected from contributors. Prior studies tend to focus on specific musical styles rather than incorporating a diverse range of genres, such as rock and classical, within a single framework. In this paper, we address the task of recognizing emotion from audio content by investigating five datasets with dimensional emotion annotations – EmoMusic, DEAM, PMEmo, WTC, and WCMED – which span various musical styles. We demonstrate the problem of out-of-distribution generalization in a systematic experiment. By closely looking at multiple data and feature sets, we provide insight into genre-emotion relationships in existing data and examine potential genre dominance and dataset biases in certain feature representations. Based on these experiments, we arrive at a simple yet effective framework that combines embeddings extracted from the Jukebox model with chroma features and demonstrate how, alongside a combination of several diverse training sets, this permits us to train models with substantially improved cross-dataset generalization capabilities.

[649] Speak, Edit, Repeat: High-Fidelity Voice Editing and Zero-Shot TTS with Cross-Attentive Mamba

Baher Mohammad, Magauiya Zhussip, Stamatios Lefkimmiatis

Main category: cs.SD

TL;DR: MAVE is a novel autoregressive architecture for voice editing and TTS that combines Mamba for efficient audio sequence modeling with cross-attention for text-acoustic alignment, achieving state-of-the-art performance with significantly lower memory requirements.

DetailsMotivation: To develop a more efficient and effective voice editing and synthesis system that can perform high-fidelity text-conditioned voice editing and competitive zero-shot TTS without explicit training on TTS tasks.

Method: Integrates Mamba (structured state-space model) for efficient audio sequence modeling with cross-attention mechanisms for precise text-acoustic alignment in an autoregressive architecture.

Result: Achieves state-of-the-art in speech editing and competitive zero-shot TTS, with 57.2% of listeners rating edited speech as perceptually equal to original. Requires ~6x less memory than VoiceCraft while maintaining similar latency.

Conclusion: MAVE establishes a new standard for flexible, high-fidelity voice editing and synthesis through synergistic integration of structured state-space modeling and cross-modal attention.

Abstract: We introduce MAVE (Mamba with Cross-Attention for Voice Editing and Synthesis), a novel autoregressive architecture for text-conditioned voice editing and high-fidelity text-to-speech (TTS) synthesis, built on a cross-attentive Mamba backbone. MAVE achieves state-of-the-art performance in speech editing and very competitive results in zero-shot TTS, while not being explicitly trained on the latter task, outperforming leading autoregressive and diffusion models on diverse, real-world audio. By integrating Mamba for efficient audio sequence modeling with cross-attention for precise text-acoustic alignment, MAVE enables context-aware voice editing with exceptional naturalness and speaker consistency. In pairwise human evaluations on a random 40-sample subset of the RealEdit benchmark (400 judgments), 57.2% of listeners rated MAVE - edited speech as perceptually equal to the original, while 24.8% prefered the original and 18.0% MAVE - demonstrating that in the majority of cases edits are indistinguishable from the source. MAVE compares favorably with VoiceCraft and FluentSpeech both on pairwise comparisons and standalone mean opinion score (MOS) evaluations. For zero-shot TTS, MAVE exceeds VoiceCraft in both speaker similarity and naturalness, without requiring multiple inference runs or post-processing. Remarkably, these quality gains come with a significantly lower memory cost and approximately the same latency: MAVE requires ~6x less memory than VoiceCraft during inference on utterances from the RealEdit database (mean duration: 6.21s, A100, FP16, batch size 1). Our results demonstrate that MAVE establishes a new standard for flexible, high-fidelity voice editing and synthesis through the synergistic integration of structured state-space modeling and cross-modal attention.

[650] SonicMaster: Towards Controllable All-in-One Music Restoration and Mastering

Jan Melechovsky, Ambuj Mehrish, Abhinaba Roy, Dorien Herremans

Main category: cs.SD

TL;DR: SonicMaster is the first unified generative model for music restoration and mastering that addresses multiple audio artifacts with text-based control, trained using flow-matching on a large dataset of simulated degradations.

DetailsMotivation: Music recordings often suffer from various audio quality issues like reverberation, distortion, clipping, tonal imbalances, and narrowed stereo image, especially in non-professional settings. Current solutions require separate specialized tools and manual adjustments.

Method: Uses a flow-matching generative training paradigm to learn audio transformations from degraded to mastered versions. Trained on SonicMaster dataset containing paired degraded and high-quality tracks simulated with 19 degradation functions across 5 enhancement groups: equalization, dynamics, reverb, amplitude, and stereo.

Result: Objective audio quality metrics show significant improvement across all artifact categories. Subjective listening tests confirm listeners prefer SonicMaster’s enhanced outputs over original degraded audio.

Conclusion: SonicMaster provides an effective unified approach for music restoration and mastering that can be controlled via natural language instructions or operate automatically, demonstrating superior performance over degraded audio inputs.

Abstract: Music recordings often suffer from audio quality issues such as excessive reverberation, distortion, clipping, tonal imbalances, and a narrowed stereo image, especially when created in non-professional settings without specialized equipment or expertise. These problems are typically corrected using separate specialized tools and manual adjustments. In this paper, we introduce SonicMaster, the first unified generative model for music restoration and mastering that addresses a broad spectrum of audio artifacts with text-based control. SonicMaster is conditioned on natural language instructions to apply targeted enhancements, or can operate in an automatic mode for general restoration. To train this model, we construct the SonicMaster dataset, a large dataset of paired degraded and high-quality tracks by simulating common degradation types with nineteen degradation functions belonging to five enhancements groups: equalization, dynamics, reverb, amplitude, and stereo. Our approach leverages a flow-matching generative training paradigm to learn an audio transformation that maps degraded inputs to their cleaned, mastered versions guided by text prompts. Objective audio quality metrics demonstrate that SonicMaster significantly improves sound quality across all artifact categories. Furthermore, subjective listening tests confirm that listeners prefer SonicMaster’s enhanced outputs over the original degraded audio, highlighting the effectiveness of our unified approach.

[651] StereoFoley: Object-Aware Stereo Audio Generation from Video

Tornike Karchkhadze, Kuan-Lin Chen, Mojtaba Heydari, Robert Henzel, Alessandro Toso, Mehrez Souden, Joshua Atkins

Main category: cs.SD

TL;DR: StereoFoley is a video-to-audio generation framework that produces high-quality stereo sound with semantic alignment, temporal synchronization, and spatial accuracy at 48kHz, addressing limitations in existing models that lack object-aware stereo imaging.

DetailsMotivation: Current video-to-audio models are limited to mono audio or lack object-aware stereo imaging due to the absence of professionally mixed, spatially accurate datasets, creating a gap in realistic audio generation for videos.

Method: Developed a base stereo audio generation model, created a synthetic data pipeline combining video analysis, object tracking, and audio synthesis with dynamic panning and distance-based loudness controls, then fine-tuned the base model on this synthetic dataset.

Result: Achieved state-of-the-art semantic accuracy and synchronization, established clear object-audio correspondence, and validated stereo object-awareness through human listening studies showing strong correlation with perception.

Conclusion: This work establishes the first end-to-end framework for stereo object-aware video-to-audio generation, addressing a critical gap and setting a new benchmark in the field.

Abstract: We present StereoFoley, a video-to-audio generation framework that produces semantically aligned, temporally synchronized, and spatially accurate stereo sound at 48 kHz. While recent generative video-to-audio models achieve strong semantic and temporal fidelity, they largely remain limited to mono or fail to deliver object-aware stereo imaging, constrained by the lack of professionally mixed, spatially accurate video-to-audio datasets. First, we develop and train a base model that generates stereo audio from video, achieving state-of-the-art in both semantic accuracy and synchronization. Next, to overcome dataset limitations, we introduce a synthetic data generation pipeline that combines video analysis, object tracking, and audio synthesis with dynamic panning and distance-based loudness controls, enabling spatially accurate object-aware sound. Finally, we fine-tune the base model on this synthetic dataset, yielding clear object-audio correspondence. Since no established metrics exist, we introduce stereo object-awareness measures and validate it through a human listening study, showing strong correlation with perception. This work establishes the first end-to-end framework for stereo object-aware video-to-audio generation, addressing a critical gap and setting a new benchmark in the field.

[652] Advanced Clustering Techniques for Speech Signal Enhancement: A Review and Metanalysis of Fuzzy C-Means, K-Means, and Kernel Fuzzy C-Means Methods

Abdulhady Abas Abdullah, Aram Mahmood Ahmed, Tarik Rashid, Hadi Veisi

Main category: cs.SD

TL;DR: This review paper analyzes clustering techniques for speech enhancement, finding that Kernel Fuzzy C-Means (KFCM) outperforms traditional methods like K-Means and Fuzzy C-Means in handling non-linear and non-stationary noise conditions.

DetailsMotivation: To improve speech clarity and comprehensibility in noisy environments for applications like voice-activated assistants and automated transcription services, where speech recognition quality directly impacts user experience and accessibility.

Method: Comparative analysis of clustering techniques, with focus on Kernel Fuzzy C-Means (KFCM) method and its comparison with traditional methods like K-Means (KM) and Fuzzy C-Means (FCM).

Result: KFCM provides superior performance in handling non-linear and non-stationary noise conditions and demonstrates adaptability to various noisy environments, making it robust for speech enhancement applications.

Conclusion: Advocates for shift towards more sophisticated, adaptive clustering techniques like KFCM and suggests integrating hybrid models combining KFCM with neural networks to enhance speech recognition accuracy and build more resilient speech processing systems.

Abstract: Speech signal processing is a cornerstone of modern communication technologies, tasked with improving the clarity and comprehensibility of audio data in noisy environments. The primary challenge in this field is the effective separation and recognition of speech from background noise, crucial for applications ranging from voice-activated assistants to automated transcription services. The quality of speech recognition directly impacts user experience and accessibility in technology-driven communication. This review paper explores advanced clustering techniques, particularly focusing on the Kernel Fuzzy C-Means (KFCM) method, to address these challenges. Our findings indicate that KFCM, compared to traditional methods like K-Means (KM) and Fuzzy C-Means (FCM), provides superior performance in handling non-linear and non-stationary noise conditions in speech signals. The most notable outcome of this review is the adaptability of KFCM to various noisy environments, making it a robust choice for speech enhancement applications. Additionally, the paper identifies gaps in current methodologies, such as the need for more dynamic clustering algorithms that can adapt in real time to changing noise conditions without compromising speech recognition quality. Key contributions include a detailed comparative analysis of current clustering algorithms and suggestions for further integrating hybrid models that combine KFCM with neural networks to enhance speech recognition accuracy. Through this review, we advocate for a shift towards more sophisticated, adaptive clustering techniques that can significantly improve speech enhancement and pave the way for more resilient speech processing systems.

[653] Prompt-aware classifier free guidance for diffusion models

Xuanhao Zhang, Chang Li

Main category: cs.SD

TL;DR: A prompt-aware framework that predicts optimal guidance scales for diffusion models to improve generation quality across varying prompt complexities.

DetailsMotivation: Fixed guidance scales in diffusion models fail to generalize across prompts of different complexity, causing oversaturation or weak alignment.

Method: Construct synthetic dataset with multiple guidance scales, train lightweight predictor using semantic embeddings and linguistic complexity to estimate quality curves and select optimal scale via utility function.

Result: Experiments on MSCOCO 2014 and AudioCaps show consistent improvements over vanilla CFG in fidelity, alignment, and perceptual preference.

Conclusion: Prompt-aware scale selection provides effective, training-free enhancement for pretrained diffusion models.

Abstract: Diffusion models have achieved remarkable progress in image and audio generation, largely due to Classifier-Free Guidance. However, the choice of guidance scale remains underexplored: a fixed scale often fails to generalize across prompts of varying complexity, leading to oversaturation or weak alignment. We address this gap by introducing a prompt-aware framework that predicts scale-dependent quality and selects the optimal guidance at inference. Specifically, we construct a large synthetic dataset by generating samples under multiple scales and scoring them with reliable evaluation metrics. A lightweight predictor, conditioned on semantic embeddings and linguistic complexity, estimates multi-metric quality curves and determines the best scale via a utility function with regularization. Experiments on MSCOCO~2014 and AudioCaps show consistent improvements over vanilla CFG, enhancing fidelity, alignment, and perceptual preference. This work demonstrates that prompt-aware scale selection provides an effective, training-free enhancement for pretrained diffusion backbones.

[654] VibE-SVC: Vibrato Extraction with High-frequency F0 Contour for Singing Voice Conversion

Joon-Seung Choi, Dong-Min Byun, Hyung-Seok Oh, Seong-Whan Lee

Main category: cs.SD

TL;DR: VibE-SVC is a controllable singing voice conversion model that explicitly extracts and manipulates vibrato using discrete wavelet transform for precise style control.

DetailsMotivation: Controlling singing style is crucial for expressive singing voice, with vibrato being a key factor for conveying emotions. However, modeling vibrato's dynamic nature remains challenging in singing voice conversion.

Method: The model uses discrete wavelet transform to decompose the F0 contour into frequency components, enabling explicit vibrato extraction and manipulation rather than implicit modeling.

Result: Experimental results show VibE-SVC effectively transforms singing styles while preserving speaker similarity, with both subjective and objective evaluations confirming high-quality conversion.

Conclusion: The proposed approach successfully enables precise vibrato control for enhanced flexibility in singing voice conversion, addressing the challenges of modeling vibrato’s dynamic nature.

Abstract: Controlling singing style is crucial for achieving an expressive and natural singing voice. Among the various style factors, vibrato plays a key role in conveying emotions and enhancing musical depth. However, modeling vibrato remains challenging due to its dynamic nature, making it difficult to control in singing voice conversion. To address this, we propose VibESVC, a controllable singing voice conversion model that explicitly extracts and manipulates vibrato using discrete wavelet transform. Unlike previous methods that model vibrato implicitly, our approach decomposes the F0 contour into frequency components, enabling precise transfer. This allows vibrato control for enhanced flexibility. Experimental results show that VibE-SVC effectively transforms singing styles while preserving speaker similarity. Both subjective and objective evaluations confirm high-quality conversion.

[655] TCDiff++: An End-to-end Trajectory-Controllable Diffusion Model for Harmonious Music-Driven Group Choreography

Yuqin Dai, Wanlu Zhu, Ronghui Li, Xiu Li, Zhenyu Zhang, Jun Li, Jian Yang

Main category: cs.SD

TL;DR: TCDiff++ is a music-driven framework that generates harmonious group dance by addressing multi-dancer collisions, foot sliding, and abrupt swapping in long sequences through specialized embeddings, loss functions, and diffusion strategies.

DetailsMotivation: Existing group dance generation methods suffer from three main issues: multi-dancer collisions, single-dancer foot sliding, and abrupt swapping in long sequences, which motivated the development of a more robust solution.

Method: Uses dancer positioning embedding for temporal/identity encoding, distance-consistency loss for collision avoidance, swap mode embedding and Footwork Adaptor for foot sliding reduction, and long group diffusion sampling with Sequence Decoder for long sequences.

Result: Extensive experiments show TCDiff++ achieves state-of-the-art performance, especially in long-duration scenarios, ensuring high-quality and coherent group dance generation.

Conclusion: TCDiff++ effectively addresses key challenges in group dance generation and demonstrates superior performance in producing harmonious, collision-free dance sequences.

Abstract: Music-driven dance generation has garnered significant attention due to its wide range of industrial applications, particularly in the creation of group choreography. During the group dance generation process, however, most existing methods still face three primary issues: multi-dancer collisions, single-dancer foot sliding and abrupt swapping in the generation of long group dance. In this paper, we propose TCDiff++, a music-driven end-to-end framework designed to generate harmonious group dance. Specifically, to mitigate multi-dancer collisions, we utilize a dancer positioning embedding to encode temporal and identity information. Additionally, we incorporate a distance-consistency loss to ensure that inter-dancer distances remain within plausible ranges. To address the issue of single-dancer foot sliding, we introduce a swap mode embedding to indicate dancer swapping patterns and design a Footwork Adaptor to refine raw motion, thereby minimizing foot sliding. For long group dance generation, we present a long group diffusion sampling strategy that reduces abrupt position shifts by injecting positional information into the noisy input. Furthermore, we integrate a Sequence Decoder layer to enhance the model’s ability to selectively process long sequences. Extensive experiments demonstrate that our TCDiff++ achieves state-of-the-art performance, particularly in long-duration scenarios, ensuring high-quality and coherent group dance generation.

[656] HNote: Extending YNote with Hexadecimal Encoding for Fine-Tuning LLMs in Music Modeling

Hung-Ying Chu, Shao-Yu Wei, Guan-Wei Chen, Tzu-Wei Hung, ChengYang Tsai, Yu-Cheng Lin

Main category: cs.SD

TL;DR: HNote is a novel hexadecimal-based notation system for symbolic music generation that encodes pitch and duration within fixed 32-unit measures, making it compatible with LLM architectures while reducing ambiguity.

DetailsMotivation: Existing music formats like MIDI, ABC, and MusicXML are either too complex or structurally inconsistent for token-based learning in LLMs, limiting their effectiveness for symbolic music generation.

Method: Extended YNote to create HNote, converted 12,300 Jiangnan-style songs from YNote to HNote, and fine-tuned LLaMA-3.1(8B) using parameter-efficient LoRA.

Result: HNote achieved 82.5% syntactic correctness rate, with BLEU and ROUGE evaluations showing strong symbolic and structural similarity, producing stylistically coherent compositions.

Conclusion: HNote establishes an effective framework for integrating LLMs with cultural music modeling, providing an aligned and unambiguous notation system for symbolic music generation.

Abstract: Recent advances in large language models (LLMs) have created new opportunities for symbolic music generation. However, existing formats such as MIDI, ABC, and MusicXML are either overly complex or structurally inconsistent, limiting their suitability for token-based learning architectures. To address these challenges, we propose HNote, a novel hexadecimal-based notation system extended from YNote, which encodes both pitch and duration within a fixed 32-unit measure framework. This design ensures alignment, reduces ambiguity, and is directly compatible with LLM architectures. We converted 12,300 Jiangnan-style songs generated from traditional folk pieces from YNote into HNote, and fine-tuned LLaMA-3.1(8B) using parameter-efficient LoRA. Experimental results show that HNote achieves a syntactic correctness rate of 82.5%, and BLEU and ROUGE evaluations demonstrate strong symbolic and structural similarity, producing stylistically coherent compositions. This study establishes HNote as an effective framework for integrating LLMs with cultural music modeling.

[657] Low Resource Audio Codec Challenge Baseline Systems

Yusuf Ziya Isik, Rafał Łaganowski

Main category: cs.SD

TL;DR: The LRAC Challenge 2025 introduces baseline neural audio codecs for low-resource environments, with Track 1 focusing on transparent speech coding and Track 2 on enhancement coding with denoising/dereverberation.

DetailsMotivation: To advance neural audio coding for deployment in resource-constrained environments that must operate reliably under everyday noise and reverberation while satisfying strict computational complexity, latency, and bitrate constraints.

Method: Convolutional neural codec models with Residual Vector Quantization, trained end-to-end using a combination of adversarial and reconstruction objectives, with detailed data filtering, augmentation strategies, and optimization procedures.

Result: Official baseline systems for both tracks (transparency codecs and enhancement codecs) in the 2025 LRAC Challenge are presented.

Conclusion: The paper establishes comprehensive baseline systems to advance low-resource neural speech codec development, addressing both transparent coding and enhancement coding scenarios under real-world noise conditions.

Abstract: The Low-Resource Audio Codec (LRAC) Challenge aims to advance neural audio coding for deployment in resource-constrained environments. The first edition focuses on low-resource neural speech codecs that must operate reliably under everyday noise and reverberation, while satisfying strict constraints on computational complexity, latency, and bitrate. Track 1 targets transparency codecs, which aim to preserve the perceptual transparency of input speech under mild noise and reverberation. Track 2 addresses enhancement codecs, which combine coding and compression with denoising and dereverberation. This paper presents the official baseline systems for both tracks in the 2025 LRAC Challenge. The baselines are convolutional neural codec models with Residual Vector Quantization, trained end-to-end using a combination of adversarial and reconstruction objectives. We detail the data filtering and augmentation strategies, model architectures, optimization procedures, and checkpoint selection criteria.

[658] Latent Multi-view Learning for Robust Environmental Sound Representations

Sivan Ding, Julia Wilkins, Magdalena Fuentes, Juan Pablo Bello

Main category: cs.SD

TL;DR: A multi-view learning framework that integrates contrastive learning into generative pipelines for environmental sound representation, encoding audio latents into view-specific and view-common subspaces with contrastive and reconstruction objectives.

DetailsMotivation: Self-supervised learning approaches like contrastive and generative methods have advanced environmental sound learning, but their complementary integration in a unified framework remains underexplored.

Method: Proposes a multi-view learning framework that encodes compressed audio latents into view-specific and view-common subspaces, guided by contrastive learning for targeted information flow and reconstruction for overall information preservation.

Result: Demonstrates improved downstream performance on urban sound sensor network dataset for sound source and sensor classification compared to traditional SSL techniques, and shows potential for disentangling environmental sound attributes.

Conclusion: The proposed framework successfully integrates contrastive and generative SSL approaches, showing enhanced performance and potential for attribute disentanglement in environmental sound representation learning.

Abstract: Self-supervised learning (SSL) approaches, such as contrastive and generative methods, have advanced environmental sound representation learning using unlabeled data. However, how these approaches can complement each other within a unified framework remains relatively underexplored. In this work, we propose a multi-view learning framework that integrates contrastive principles into a generative pipeline to capture sound source and device information. Our method encodes compressed audio latents into view-specific and view-common subspaces, guided by two self-supervised objectives: contrastive learning for targeted information flow between subspaces, and reconstruction for overall information preservation. We evaluate our method on an urban sound sensor network dataset for sound source and sensor classification, demonstrating improved downstream performance over traditional SSL techniques. Additionally, we investigate the model’s potential to disentangle environmental sound attributes within the structured latent space under varied training configurations.

cs.LG

[659] PARS: Low-Latency LLM Serving via Pairwise Learning-to-Rank

Yiheng Tao, Yihe Zhang, Matthew T. Dearing, Xin Wang, Yuping Fan, Zhiling Lan

Main category: cs.LG

TL;DR: PARS is a prompt-aware LLM task scheduler that uses pairwise ranking to approximate shortest-job-first scheduling, reducing latency by predicting response lengths and minimizing head-of-line blocking in LLM inference.

DetailsMotivation: Traditional scheduling strategies like FCFS suffer from head-of-line blocking where long-running tasks delay shorter ones, leading to inefficient LLM inference with poor latency and throughput.

Method: PARS uses pairwise ranking with margin ranking loss to predict response-length-based task ordering, approximating shortest-job-first scheduling. It integrates seamlessly into vLLM serving system with minimal overhead.

Result: Extensive experiments across multiple LLMs and real-world datasets show PARS significantly improves performance, including for reasoning workloads. Cross-model evaluations demonstrate good generalization when predictors are trained on different LLMs.

Conclusion: PARS effectively addresses head-of-line blocking in LLM inference through prompt-aware scheduling, achieving substantial latency improvements while maintaining generalization across different models.

Abstract: Efficient scheduling of LLM inference tasks is essential for achieving low latency and high throughput, particularly with the growing use of reasoning-capable LLMs. Traditional strategies like First-Come-First-Serve (FCFS) often suffer from Head-of-Line (HOL) blocking, where long-running tasks delay shorter ones queued behind them. In this paper, we introduce PARS, a prompt-aware LLM task scheduler that improves serving efficiency by approximating shortest-job-first (SJF) scheduling through pairwise ranking with margin ranking loss. PARS focuses on impactful scheduling decisions and is seamlessly integrated into the state-of-the-art LLM serving system vLLM. It effectively predicts response-length-based task ordering, reducing latency with minimal overhead. Extensive experiments across multiple LLMs and real-world inference datasets show that PARS significantly improves performance, including for reasoning workloads. Furthermore, our cross-model evaluations demonstrate that the design generalizes well, enabling effective scheduling even when predictors are trained on different LLMs.

[660] VIFO: Visual Feature Empowered Multivariate Time Series Forecasting with Cross-Modal Fusion

Yanlong Wang, Hang Yu, Jian Xu, Fei Ma, Hongkang Zhang, Tongtong Feng, Zijian Zhang, Shao-Lun Huang, Danny Dongning Sun, Xiao-Ping Zhang

Main category: cs.LG

TL;DR: VIFO is a cross-modal forecasting model that converts multivariate time series into images to leverage pre-trained large vision models for capturing cross-channel dependencies, achieving competitive performance with minimal parameter training.

DetailsMotivation: Existing channel-independent time series models ignore cross-channel dependencies, and multimodal approaches haven't fully utilized large vision models for spatiotemporal data interpretation. There's untapped potential in using different modalities to enhance time series forecasting.

Method: VIFO renders multivariate time series into images to enable pre-trained large vision models to extract complex cross-channel patterns. These visual features are aligned and fused with time series representations. The model freezes the LVM and trains only 7.45% of parameters.

Result: VIFO achieves competitive performance on multiple benchmarks, offering an efficient solution for capturing cross-variable relationships in time series forecasting.

Conclusion: The proposed cross-modal approach effectively leverages vision models to capture cross-channel dependencies in time series data while maintaining efficiency through parameter-efficient training.

Abstract: Large time series foundation models often adopt channel-independent architectures to handle varying data dimensions, but this design ignores crucial cross-channel dependencies. Concurrently, existing multimodal approaches have not fully exploited the power of large vision models (LVMs) to interpret spatiotemporal data. Additionally, there remains significant unexplored potential in leveraging the advantages of information extraction from different modalities to enhance time series forecasting performance. To address these gaps, we propose the VIFO, a cross-modal forecasting model. VIFO uniquely renders multivariate time series into image, enabling pre-trained LVM to extract complex cross-channel patterns that are invisible to channel-independent models. These visual features are then aligned and fused with representations from the time series modality. By freezing the LVM and training only 7.45% of its parameters, VIFO achieves competitive performance on multiple benchmarks, offering an efficient and effective solution for capturing cross-variable relationships in

[661] Frequency-Aware Model Parameter Explorer: A new attribution method for improving explainability

Ali Yavari, Alireza Mohamadi, Elham Beydaghi, Rainer A. Leitgeb

Main category: cs.LG

TL;DR: The paper proposes transferable frequency-aware attacks and a novel attribution method called FAMPE that improves DNN explainability by leveraging both high- and low-frequency components, achieving 13.02% average gain in Insertion Score over state-of-the-art methods.

DetailsMotivation: To address the challenge of ensuring DNN reliability against real-world noise and intentional perturbations, and to improve the suboptimal efficacy of existing attribution methods.

Method: Proposes transferable frequency-aware attacks that enable frequency-aware exploration via both high- and low-frequency components, and develops FAMPE (Frequency-Aware Model Parameter Explorer) attribution method.

Result: FAMPE achieves an average gain of 13.02% in Insertion Score compared to the state-of-the-art method AttEXplore, outperforming existing approaches.

Conclusion: The proposed frequency-aware approach effectively improves DNN explainability, with detailed ablation studies confirming the importance of both high- and low-frequency components in attribution methods.

Abstract: Ensuring the reliability of deep neural networks (DNNs) in the presence of real world noise and intentional perturbations remains a significant challenge. To address this, attribution methods have been proposed, though their efficacy remains suboptimal and necessitates further refinement. In this paper, we propose a novel category of transferable adversarial attacks, called transferable frequency-aware attacks, enabling frequency-aware exploration via both high-and low-frequency components. Based on this type of attacks, we also propose a novel attribution method, named Frequency-Aware Model Parameter Explorer (FAMPE), which improves the explainability for DNNs. Relative to the current state-of-the-art method AttEXplore, our FAMPE attains an average gain of 13.02% in Insertion Score, thereby outperforming existing approaches. Through detailed ablation studies, we also investigate the role of both high- and low-frequency components in explainability.

[662] StructPrune: Structured Global Pruning asymptotics with $\mathcal{O}(\sqrt{N})$ GPU Memory

Xinyuan Song, Guangji Bai, Liang Zhao

Main category: cs.LG

TL;DR: STRUPRUNE is a memory-efficient structured pruning framework that combines local pruning’s memory benefits with structured pruning’s hardware efficiency using a divide-and-conquer ADMM approach.

DetailsMotivation: Global pruning requires O(N) memory (infeasible for billion-parameter models), while local pruning neglects inter-layer dependencies and performs poorly at high sparsity. Structured pruning is hardware-efficient but typically relies on global methods.

Method: Divide-and-conquer strategy decomposing global pruning into coordinated subproblems across modules. ADMM-based framework with closed-form analytical solution for structured pruning masks and energy-based asymptotic framework for layer-wise sparsity allocation.

Result: Matches perplexity of global structured pruning while reducing memory cost from O(N) to O(√N), enabling practical deployment at billion-parameter scale.

Conclusion: STRUPRUNE successfully achieves both structured pruning benefits and local pruning memory efficiency, making large-scale model pruning practical.

Abstract: Pruning is critical for scaling large language models (LLMs). Global pruning achieves strong performance but requires $\mathcal{O}(N)$ memory, which is infeasible for billion-parameter models. Local pruning reduces GPU memory usage to that of a single layer by pruning layers independently, but it neglects inter-layer dependencies and often leads to suboptimal performance in high-sparsity regimes. Unlike unstructured pruning, structured pruning produces regular sparsity patterns that align well with GPU kernels and library optimizations, making it more hardware-efficient. However, structured pruning typically relies on global pruning, since structured patterns are more prone to severe performance degradation under local optimization. To jointly achieve structured pruning and the memory efficiency of local pruning, we propose a divide-and-conquer strategy that decomposes the global pruning problem into coordinated subproblems across different modules, each of which fits within limited GPU memory. Building on this idea, we design \textbf{STRUPRUNE}, an ADMM-based framework that integrates structured sparsity into the pruning process, combining the memory efficiency of local pruning with the hardware compatibility of structured methods. We derive a closed-form analytical solution for structured pruning masks that provides an explicit rule for layer-wise sparsity allocation, and further develop an energy-based asymptotic framework yielding a softmax-form allocation scheme that simplifies optimization while adapting to heterogeneous layer importance. Experiments demonstrate that STRUPRUNE matches the perplexity of global structured pruning while reducing memory cost from $\mathcal{O}(N)$ to $\mathcal{O}(\sqrt{N})$, enabling practical deployment at the billion-parameter scale.

[663] Towards Multimodal Active Learning: Efficient Learning with Limited Paired Data

Jiancheng Zhang, Yinglun Zhu

Main category: cs.LG

TL;DR: First framework for multimodal active learning with unaligned data that reduces annotation costs by actively acquiring cross-modal alignments instead of labels on pre-aligned pairs.

DetailsMotivation: Existing AL algorithms focus only on unimodal data, overlooking the substantial annotation burden in multimodal learning where high-quality alignment between modalities is costly.

Method: Developed a new algorithm combining uncertainty and diversity principles in modality-aware design with linear-time acquisition, applicable to both pool-based and streaming-based settings.

Result: Extensive experiments show consistent reduction in multimodal annotation cost while preserving performance; on ColorSwap dataset, cuts annotation requirements by up to 40% without accuracy loss.

Conclusion: The framework effectively addresses the practical bottleneck in modern multimodal pipelines by reducing alignment annotation costs while maintaining model performance.

Abstract: Active learning (AL) is a principled strategy to reduce annotation cost in data-hungry deep learning. However, existing AL algorithms focus almost exclusively on unimodal data, overlooking the substantial annotation burden in multimodal learning. We introduce the first framework for multimodal active learning with unaligned data, where the learner must actively acquire cross-modal alignments rather than labels on pre-aligned pairs. This setting captures the practical bottleneck in modern multimodal pipelines such as CLIP and SigLIP, where unimodal features are easy to obtain but high-quality alignment is costly. We develop a new algorithm that combines uncertainty and diversity principles in a modality-aware design, achieves linear-time acquisition, and applies seamlessly to both pool-based and streaming-based settings. Extensive experiments on benchmark datasets demonstrate that our approach consistently reduces multimodal annotation cost while preserving performance; for instance, on the ColorSwap dataset it cuts annotation requirements by up to $40%$ without loss in accuracy.

[664] Real-Time Brain Biomechanics Prediction with Neural Operators: Toward Clinically Deployable Traumatic Brain Injury Models

Anusha Agarwal, Dibakar Roy Sarkar, Somdatta Goswami

Main category: cs.LG

TL;DR: Neural operators enable real-time brain deformation prediction for traumatic brain injury assessment, reducing computation time from hours to milliseconds while maintaining high accuracy.

DetailsMotivation: Finite element models for traumatic brain injury are computationally expensive (hours per simulation), limiting clinical utility for rapid decision-making. There's a need for faster, patient-specific prediction methods.

Method: Used four neural operator architectures (FNO, F-FNO, MG-FNO, DeepONet) to map anatomical MRI, MRE stiffness maps, and demographic features to 3D brain displacement fields. Trained on 249 MRE datasets across 20-90 Hz frequencies.

Result: MG-FNO achieved highest accuracy (MSE = 0.0023, 94.3% spatial fidelity), F-FNO converged 2× faster than standard FNO, and DeepONet offered fastest inference (14.5 iterations/s) with 7× speed-up over MG-FNO. All NOs reduced computation from hours to milliseconds.

Conclusion: Neural operators provide efficient, resolution-invariant brain deformation prediction, enabling real-time TBI risk assessment, clinical triage support, and optimization of protective equipment. They show potential for scalable digital twins of the human brain.

Abstract: Traumatic brain injury (TBI) remains a major public health concern, with over 69 million cases annually worldwide. Finite element (FE) models offer high-fidelity predictions of brain deformation but are computationally expensive, requiring hours per simulation and limiting their clinical utility for rapid decision-making. This study benchmarks state-of-the-art neural operator (NO) architectures for rapid, patient-specific prediction of brain displacement fields, aiming to enable real-time TBI modeling in clinical and translational settings. We formulated TBI modeling as an operator learning problem, mapping subject-specific anatomical MRI, magnetic resonance elastography (MRE) stiffness maps, and demographic features to full-field 3D brain displacement predictions. Four architectures - Fourier Neural Operator (FNO), Factorized FNO (F-FNO), Multi-Grid FNO (MG-FNO), and Deep Operator Network (DeepONet) were trained and evaluated on 249 MRE datasets across physiologically relevant frequencies (20 - 90 Hz). MG-FNO achieved the highest accuracy (MSE = 0.0023, 94.3% spatial fidelity) and preserved fine-scale features, while F-FNO converged 2$\times$ faster than standard FNO. DeepONet offered the fastest inference (14.5 iterations/s) with a 7$\times$ computational speed-up over MG-FNO, suggesting utility for embedded or edge computing applications. All NOs reduced computation time from hours to milliseconds without sacrificing anatomical realism. NOs provide an efficient, resolution-invariant approach for predicting brain deformation, opening the door to real-time, patient-specific TBI risk assessment, clinical triage support, and optimization of protective equipment. These results highlight the potential for NO-based digital twins of the human brain, enabling scalable, on-demand biomechanical modeling in both clinical and population health contexts.

[665] Light Differentiable Logic Gate Networks

Lukas Rüttgers, Till Aczel, Andreas Plesner, Roger Wattenhofer

Main category: cs.LG

TL;DR: The paper proposes a reparametrization of logic gate neurons in Differentiable Logic Gate Networks (DLGNs) to address vanishing gradients, discretization errors, and high training costs, achieving faster convergence, smaller model size, and maintained accuracy.

DetailsMotivation: DLGNs are efficient at inference but suffer from vanishing gradients, discretization errors, and high training costs that impede scaling, even with improved initialization schemes.

Method: The authors propose a reparametrization of logic gate neurons that shrinks parameter size logarithmically in the number of inputs per gate.

Result: For binary inputs, the reparametrization reduces model size by 4x, speeds up backward pass by up to 1.86x, converges in 8.5x fewer training steps, and maintains or improves accuracy on CIFAR-100 compared to original parametrization.

Conclusion: The root cause of DLGN scaling issues lies in the parametrization of logic gate neurons, and the proposed reparametrization effectively addresses these limitations while improving efficiency.

Abstract: Differentiable logic gate networks (DLGNs) exhibit extraordinary efficiency at inference while sustaining competitive accuracy. But vanishing gradients, discretization errors, and high training cost impede scaling these networks. Even with dedicated parameter initialization schemes from subsequent works, increasing depth still harms accuracy. We show that the root cause of these issues lies in the underlying parametrization of logic gate neurons themselves. To overcome this issue, we propose a reparametrization that also shrinks the parameter size logarithmically in the number of inputs per gate. For binary inputs, this already reduces the model size by 4x, speeds up the backward pass by up to 1.86x, and converges in 8.5x fewer training steps. On top of that, we show that the accuracy on CIFAR-100 remains stable and sometimes superior to the original parametrization.

[666] Triple-BERT: Do We Really Need MARL for Order Dispatch on Ride-Sharing Platforms?

Zijian Zhao, Sen Li

Main category: cs.LG

TL;DR: Triple-BERT is a centralized single agent reinforcement learning method that improves order dispatching for ride-sharing platforms by handling large action and observation spaces through action decomposition and BERT-based networks.

DetailsMotivation: Existing MARL methods for ride-sharing order dispatching either fail to capture global information (independent MARL) or suffer from dimensionality issues (CTDE MARL), necessitating a better approach for large-scale systems.

Method: Built on TD3 variant, uses action decomposition to break joint action probability into individual driver actions, and employs BERT-based network with parameter reuse and attention mechanisms to handle large observation spaces and capture complex driver-order relationships.

Result: Achieved 11.95% improvement over state-of-the-art methods, with 4.26% increase in served orders and 22.25% reduction in pickup times on real-world Manhattan ride-hailing dataset.

Conclusion: Triple-BERT effectively addresses the challenges of large-scale order dispatching in ride-sharing platforms and demonstrates significant performance improvements over existing methods.

Abstract: On-demand ride-sharing platforms, such as Uber and Lyft, face the intricate real-time challenge of bundling and matching passengers-each with distinct origins and destinations-to available vehicles, all while navigating significant system uncertainties. Due to the extensive observation space arising from the large number of drivers and orders, order dispatching, though fundamentally a centralized task, is often addressed using Multi-Agent Reinforcement Learning (MARL). However, independent MARL methods fail to capture global information and exhibit poor cooperation among workers, while Centralized Training Decentralized Execution (CTDE) MARL methods suffer from the curse of dimensionality. To overcome these challenges, we propose Triple-BERT, a centralized Single Agent Reinforcement Learning (MARL) method designed specifically for large-scale order dispatching on ride-sharing platforms. Built on a variant TD3, our approach addresses the vast action space through an action decomposition strategy that breaks down the joint action probability into individual driver action probabilities. To handle the extensive observation space, we introduce a novel BERT-based network, where parameter reuse mitigates parameter growth as the number of drivers and orders increases, and the attention mechanism effectively captures the complex relationships among the large pool of driver and orders. We validate our method using a real-world ride-hailing dataset from Manhattan. Triple-BERT achieves approximately an 11.95% improvement over current state-of-the-art methods, with a 4.26% increase in served orders and a 22.25% reduction in pickup times. Our code, trained model parameters, and processed data are publicly available at the repository https://github.com/RS2002/Triple-BERT .

[667] Numerion: A Multi-Hypercomplex Model for Time Series Forecasting

Hanzhong Cao, Wenbo Yan, Ying Tan

Main category: cs.LG

TL;DR: Numerion is a time series forecasting model that uses multiple hypercomplex spaces to naturally decompose time series, leveraging the discovery that characteristic frequencies decrease in higher-order hypercomplex spaces.

DetailsMotivation: Existing time series forecasting methods rely on complex decomposition techniques with computational limitations and robustness issues. The research discovered that time series frequencies naturally decrease in hypercomplex spaces, providing a more natural decomposition approach.

Method: Proposes Numerion model with Real-Hypercomplex-Real Domain Multi-Layer Perceptrons (RHR-MLPs) that generalize linear layers and activation functions to hypercomplex spaces of power-of-two dimensions. Uses multiple RHR-MLPs to map time series into different hypercomplex spaces and fuses patterns through dynamic fusion.

Result: Achieves state-of-the-art performance on multiple public datasets. Visualizations and analyses demonstrate the model’s ability to naturally decompose time series and show that higher dimensional hypercomplex spaces capture lower frequency features.

Conclusion: Numerion provides an effective approach for time series forecasting by leveraging hypercomplex spaces for natural decomposition, overcoming limitations of traditional decomposition methods while achieving superior performance.

Abstract: Many methods aim to enhance time series forecasting by decomposing the series through intricate model structures and prior knowledge, yet they are inevitably limited by computational complexity and the robustness of the assumptions. Our research uncovers that in the complex domain and higher-order hypercomplex spaces, the characteristic frequencies of time series naturally decrease. Leveraging this insight, we propose Numerion, a time series forecasting model based on multiple hypercomplex spaces. Specifically, grounded in theoretical support, we generalize linear layers and activation functions to hypercomplex spaces of arbitrary power-of-two dimensions and introduce a novel Real-Hypercomplex-Real Domain Multi-Layer Perceptron (RHR-MLP) architecture. Numerion utilizes multiple RHR-MLPs to map time series into hypercomplex spaces of varying dimensions, naturally decomposing and independently modeling the series, and adaptively fuses the latent patterns exhibited in different spaces through a dynamic fusion mechanism. Experiments validate the model`s performance, achieving state-of-the-art results on multiple public datasets. Visualizations and quantitative analyses comprehensively demonstrate the ability of multi-dimensional RHR-MLPs to naturally decompose time series and reveal the tendency of higher dimensional hypercomplex spaces to capture lower frequency features.

[668] KVComm: Enabling Efficient LLM Communication through Selective KV Sharing

Xiangyu Shi, Marco Chiesa, Gerald Q. Maguire Jr., Dejan Kostic

Main category: cs.LG

TL;DR: KVComm enables efficient LLM communication by selectively sharing KV pairs using attention importance scores with Gaussian prior, achieving comparable performance to upper-bound methods while transmitting only 30% of layers’ KV pairs.

DetailsMotivation: Existing LLM communication protocols either use natural language (high cost, information loss) or hidden states (bias, inefficiency), requiring a more efficient communication medium for multi-agent systems.

Method: Proposed KVComm framework that selectively shares KV pairs between LLMs using layer-wise selection strategy based on attention importance scores with Gaussian prior to identify most informative KV pairs.

Result: KVComm achieves comparable performance to upper-bound method (direct input merging) while transmitting only 30% of layers’ KV pairs across diverse tasks and model pairs.

Conclusion: KV pairs serve as an effective medium for inter-LLM communication, enabling scalable and efficient multi-agent systems with minimal communication overhead.

Abstract: Large Language Models (LLMs) are increasingly deployed in multi-agent systems, where effective inter-model communication is crucial. Existing communication protocols either rely on natural language, incurring high inference costs and information loss, or on hidden states, which suffer from information concentration bias and inefficiency. To address these limitations, we propose KVComm, a novel communication framework that enables efficient communication between LLMs through selective sharing of KV pairs. KVComm leverages the rich information encoded in the KV pairs while avoiding the pitfalls of hidden states. We introduce a KV layer-wise selection strategy based on attention importance scores with a Gaussian prior to identify the most informative KV pairs for communication. Extensive experiments across diverse tasks and model pairs demonstrate that KVComm achieves comparable performance to the upper-bound method, which directly merges inputs to one model without any communication, while transmitting as few as 30% of layers’ KV pairs. Our study highlights the potential of KV pairs as an effective medium for inter-LLM communication, paving the way for scalable and efficient multi-agent systems.

[669] Universal Multi-Domain Translation via Diffusion Routers

Duc Kieu, Kien Do, Tuan Hoang, Thao Minh Le, Tung Kieu, Dang Nguyen, Thin Nguyen

Main category: cs.LG

TL;DR: Universal Multi-Domain Translation (UMDT) enables translation between any pair of K domains using only K-1 paired datasets with a central domain, overcoming limitations of existing approaches that require full alignment or can only handle seen domain pairs.

DetailsMotivation: Existing multi-domain translation approaches are limited by requiring fully aligned tuples or only handling domain pairs seen during training, which restricts practicality and excludes many cross-domain mappings.

Method: Proposed Diffusion Router (DR) - a unified diffusion-based framework that models all central↔non-central translations with a single noise predictor conditioned on source and target domain labels, enabling indirect translations through the central domain and supporting direct non-central mappings via variational-bound objective and Tweedie refinement.

Result: DR achieves state-of-the-art results on three large-scale UMDT benchmarks for both indirect and direct translations, while lowering sampling cost and enabling novel tasks like sketch↔segmentation.

Conclusion: DR establishes a scalable and versatile framework for universal translation across multiple domains, overcoming limitations of previous approaches.

Abstract: Multi-domain translation (MDT) aims to learn translations between multiple domains, yet existing approaches either require fully aligned tuples or can only handle domain pairs seen in training, limiting their practicality and excluding many cross-domain mappings. We introduce universal MDT (UMDT), a generalization of MDT that seeks to translate between any pair of $K$ domains using only $K-1$ paired datasets with a central domain. To tackle this problem, we propose Diffusion Router (DR), a unified diffusion-based framework that models all central$\leftrightarrow$non-central translations with a single noise predictor conditioned on the source and target domain labels. DR enables indirect non-central translations by routing through the central domain. We further introduce a novel scalable learning strategy with a variational-bound objective and an efficient Tweedie refinement procedure to support direct non-central mappings. Through evaluation on three large-scale UMDT benchmarks, DR achieves state-of-the-art results for both indirect and direct translations, while lowering sampling cost and unlocking novel tasks such as sketch$\leftrightarrow$segmentation. These results establish DR as a scalable and versatile framework for universal translation across multiple domains.

[670] The Argument is the Explanation: Structured Argumentation for Trust in Agents

Ege Cakar, Per Ola Kristensson

Main category: cs.LG

TL;DR: The paper proposes using structured argumentation to provide verifiable reasoning chains for AI explainability, achieving state-of-the-art performance on argument relation classification and enabling transparent multi-agent risk assessment with automatic hallucination detection.

DetailsMotivation: Current AI explainability approaches focus on mechanistic transparency, but stakeholders actually need verifiable arguments similar to how human society functions. Neither interpretability nor LLM-generated explanations can provide the level of verification required.

Method: A pipeline that converts LLM text into argument graphs using Bipolar Assumption-Based Argumentation to capture support/attack relationships. This enables verification at each inferential step and includes a mechanism for iterative refinement through test-time feedback without retraining.

Result: Achieved state-of-the-art 94.44 macro F1 on AAEC dataset (5.7 points above prior work) and 0.81 macro F1 for Argumentative MicroTexts relation classification (~0.07 above previous results). Successfully demonstrated on multi-agent risk assessment using Structured What-If Technique.

Conclusion: Structured argumentation provides a superior approach to AI explainability by offering verifiable reasoning chains that enable transparent collaboration between specialized agents and automatic detection of hallucinations through fact-based argument attacks.

Abstract: Humans are black boxes – we cannot observe their neural processes, yet society functions by evaluating verifiable arguments. AI explainability should follow this principle: stakeholders need verifiable reasoning chains, not mechanistic transparency. We propose using structured argumentation to provide a level of explanation and verification neither interpretability nor LLM-generated explanation is able to offer. Our pipeline achieves state-of-the-art 94.44 macro F1 on the AAEC published train/test split (5.7 points above prior work) and $0.81$ macro F1, $\sim$0.07 above previous published results with comparable data setups, for Argumentative MicroTexts relation classification, converting LLM text into argument graphs and enabling verification at each inferential step. We demonstrate this idea on multi-agent risk assessment using the Structured What-If Technique, where specialized agents collaborate transparently to carry out risk assessment otherwise achieved by humans alone. Using Bipolar Assumption-Based Argumentation, we capture support/attack relationships, thereby enabling automatic hallucination detection via fact nodes attacking arguments. We also provide a verification mechanism that enables iterative refinement through test-time feedback without retraining. For easy deployment, we provide a Docker container for the fine-tuned AMT model, and the rest of the code with the Bipolar ABA Python package on GitHub.

[671] Solving the Granularity Mismatch: Hierarchical Preference Learning for Long-Horizon LLM Agents

Heyang Gao, Zexu Sun, Erxue Min, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Xu Chen

Main category: cs.LG

TL;DR: HPL introduces hierarchical preference learning with multi-granularity DPO and dual-layer curriculum to optimize LLM agents, outperforming SOTA methods on agent benchmarks.

DetailsMotivation: Existing preference-based offline methods like DPO face granularity mismatch - trajectory-level is too coarse for credit assignment while step-level is too myopic for multi-step behaviors.

Method: HPL framework with hierarchical DPO (trajectory, step, and group levels) plus dual-layer curriculum scheduler organizing learning from simple to complex along group length and sample difficulty axes.

Result: HPL outperforms state-of-the-art methods on three challenging agent benchmarks, effectively integrating preference signals across granularities.

Conclusion: The hierarchical DPO loss with dual-layer curriculum enables agents to solve tasks ranging from simple behaviors to complex multi-step sequences.

Abstract: Large Language Models (LLMs) as autonomous agents are increasingly tasked with solving complex, long-horizon problems. Aligning these agents via preference-based offline methods like Direct Preference Optimization (DPO) is a promising direction, yet it faces a critical granularity mismatch. Trajectory-level DPO provides a signal that is too coarse for precise credit assignment, while step-level DPO is often too myopic to capture the value of multi-step behaviors. To resolve this challenge, we introduce Hierarchical Preference Learning (HPL), a hierarchical framework that optimizes LLM agents by leveraging preference signals at multiple, synergistic granularities. While HPL incorporates trajectory- and step-level DPO for global and local policy stability, its core innovation lies in group-level preference optimization guided by a dual-layer curriculum. Our approach first decomposes expert trajectories into semantically coherent action groups and then generates contrasting suboptimal groups to enable preference learning at a fine-grained, sub-task level. Then, instead of treating all preference pairs equally, HPL introduces a curriculum scheduler that organizes the learning process from simple to complex. This curriculum is structured along two axes: the group length, representing sub-task complexity, and the sample difficulty, defined by the reward gap between preferred and dispreferred action groups. Experiments on three challenging agent benchmarks show that HPL outperforms existing state-of-the-art methods. Our analyses demonstrate that the hierarchical DPO loss effectively integrates preference signals across multiple granularities, while the dual-layer curriculum is crucial for enabling the agent to solve a wide range of tasks, from simple behaviors to complex multi-step sequences.

[672] Adversarial training with restricted data manipulation

David Benfield, Stefano Coniglio, Phan Tu Vuong, Alain Zemkoho

Main category: cs.LG

TL;DR: The paper proposes a constrained pessimistic bilevel optimization model to train more realistic and effective classifiers against adversarial attacks, addressing limitations of existing unrestricted adversary approaches.

DetailsMotivation: Existing pessimistic bilevel optimization methods for adversarial machine learning allow unrestricted adversaries, which can lead to overly pessimistic models that perform poorly on real-world data when adversaries generate nonsensical data.

Method: The authors develop a constrained pessimistic bilevel optimization model that restricts the adversary’s movements to ensure the generated adversarial data maintains its intended nature and better reflects reality.

Result: Experimental results show that the proposed constrained model performs better on average than the existing unrestricted adversary approach.

Conclusion: Constraining the adversary in pessimistic bilevel optimization leads to more realistic and effective classifiers for adversarial machine learning scenarios.

Abstract: Adversarial machine learning concerns situations in which learners face attacks from active adversaries. Such scenarios arise in applications such as spam email filtering, malware detection and fake image generation, where security methods must be actively updated to keep up with the everimproving generation of malicious data. Pessimistic Bilevel optimisation has been shown to be an effective method of training resilient classifiers against such adversaries. By modelling these scenarios as a game between the learner and the adversary, we anticipate how the adversary will modify their data and then train a resilient classifier accordingly. However, since existing pessimistic bilevel approaches feature an unrestricted adversary, the model is vulnerable to becoming overly pessimistic and unrealistic. When finding the optimal solution that defeats the classifier, it is possible that the adversary’s data becomes nonsensical and loses its intended nature. Such an adversary will not properly reflect reality, and consequently, will lead to poor classifier performance when implemented on real-world data. By constructing a constrained pessimistic bilevel optimisation model, we restrict the adversary’s movements and identify a solution that better reflects reality. We demonstrate through experiments that this model performs, on average, better than the existing approach.

[673] SciTS: Scientific Time Series Understanding and Generation with LLMs

Wen Wu, Ziyang Zhang, Liwei Liu, Xuenan Xu, Junlin Liu, Ke Fan, Qitan Lv, Jimin Zhuang, Chen Zhang, Zheqi Yuan, Siyuan Hou, Tianyi Lin, Kai Chen, Bowen Zhou, Chao Zhang

Main category: cs.LG

TL;DR: SciTS benchmark for scientific time series analysis reveals LLMs outperform specialized models, leading to TimeOmni framework for better time series understanding and generation.

DetailsMotivation: Current multimodal LLMs inadequately handle scientific time series by converting them to text or images, losing precision and struggling with long sequences. No dedicated benchmarks exist for scientific time series across domains.

Method: Created SciTS benchmark with 12 domains, 43 tasks, 50k+ instances including univariate/multivariate signals. Benchmarked 17 models including LLMs and specialized time series models. Developed TimeOmni framework for LLM time series capabilities.

Result: General-purpose LLMs show stronger generalizability than specialized time series models. Text/image representations limit performance due to sequence length and precision loss. TimeOmni enables effective time series understanding and generation.

Conclusion: LLMs have strong potential for scientific time series tasks when properly equipped. TimeOmni framework bridges the gap in scientific time series modeling, enabling LLMs to handle complex temporal scientific data.

Abstract: The scientific reasoning ability of large language models (LLMs) has recently attracted significant attention. Time series, as a fundamental modality in scientific data, presents unique challenges that are often overlooked in current multimodal LLMs, which either encode numerical sequences as text or convert them into images. Such approaches may be insufficient for comprehensive scientific time series understanding and generation. Existing unified time series models typically specialise in either forecasting or analysis, and their effectiveness on non-periodic, heterogeneous scientific signals remains unclear. To address these gaps, we introduce SciTS, a benchmark spanning 12 scientific domains and 43 tasks, with over 50k+ instances, both univariate and multivariate signals ranging from $10^0$ to $10^7$ in length and up to 10~MHz in frequency. We benchmark 17 models, including text-only LLMs, multimodal LLMs, and unified time series models, and find that general-purpose LLMs exhibit stronger generalisability than specialised time series models, while representing time series as text or images limits their performance due to excessively long sequences and loss of numerical precision, respectively. We then introduce TimeOmni, a framework that equips LLMs with the ability to understand and generate time series while remaining compatible with general-purpose LLM training. This work fills a gap in both dedicated benchmarks and modelling frameworks for scientific time series, paving the way for LLMs to understand and generate complex temporal scientific data.

[674] Matching the Optimal Denoiser in Point Cloud Diffusion with (Improved) Rotational Alignment

Ameya Daigavane, YuQing Xie, Bodhi P. Vani, Saeed Saremi, Joseph Kleinhenz, Tess Smidt

Main category: cs.LG

TL;DR: The paper analyzes the effect of rotational alignment in diffusion models for point clouds, showing that alignment corresponds to sampling the mode of a matrix Fisher distribution and is effective as a zeroth-order approximation for small noise levels.

DetailsMotivation: To understand why rotational alignment works well in diffusion models for symmetric point clouds like molecules and proteins, where there's no canonical orientation.

Method: The authors express the optimal denoiser in terms of a matrix Fisher distribution over SO(3) and derive better approximators to the optimal denoiser for small noise levels.

Result: Alignment is shown to be the zeroth-order approximation for small noise levels, explaining its effectiveness, and the experiments confirm that alignment is often sufficient for the noise levels most relevant in training diffusion models.

Conclusion: Rotational alignment in diffusion models for point clouds is mathematically justified as sampling the mode of a matrix Fisher distribution and serves as an effective approximation for practical training scenarios.

Abstract: Diffusion models are a popular class of generative models trained to reverse a noising process starting from a target data distribution. Training a diffusion model consists of learning how to denoise noisy samples at different noise levels. When training diffusion models for point clouds such as molecules and proteins, there is often no canonical orientation that can be assigned. To capture this symmetry, the true data samples are often augmented by transforming them with random rotations sampled uniformly over $SO(3)$. Then, the denoised predictions are often rotationally aligned via the Kabsch-Umeyama algorithm to the ground truth samples before computing the loss. However, the effect of this alignment step has not been well studied. Here, we show that the optimal denoiser can be expressed in terms of a matrix Fisher distribution over $SO(3)$. Alignment corresponds to sampling the mode of this distribution, and turns out to be the zeroth order approximation for small noise levels, explaining its effectiveness. We build on this perspective to derive better approximators to the optimal denoiser in the limit of small noise. Our experiments highlight that alignment is often a `good enough’ approximation for the noise levels that matter most for training diffusion models.

[675] POEM: Explore Unexplored Reliable Samples to Enhance Test-Time Adaptation

Chang’an Yi, Xiaohui Deng, Shuaicheng Niu, Yan Zhou

Main category: cs.LG

TL;DR: POEM is a test-time adaptation method that explores previously overlooked reliable samples and uses an Adapt Branch to balance domain-agnostic representations with target performance, outperforming existing TTA methods.

DetailsMotivation: Existing TTA methods rely on entropy thresholds that cause reliable samples to be overlooked, as samples that slightly exceed thresholds initially may become reliable after model updates.

Method: Proposes POEM approach to explore previously unexplored reliable samples and introduces an Adapt Branch network to balance domain-agnostic representations with target performance.

Result: Comprehensive experiments show POEM consistently outperforms existing TTA methods across multiple architectures in challenging scenarios and real-world domain shifts while remaining computationally efficient.

Conclusion: POEM effectively addresses the limitation of entropy-based TTA methods and can be used as an augmentation strategy to boost existing TTA approaches.

Abstract: Test-time adaptation (TTA) aims to transfer knowledge from a source model to unknown test data with potential distribution shifts in an online manner. Many existing TTA methods rely on entropy as a confidence metric to optimize the model. However, these approaches are sensitive to the predefined entropy threshold, influencing which samples are chosen for model adaptation. Consequently, potentially reliable target samples are often overlooked and underutilized. For instance, a sample’s entropy might slightly exceed the threshold initially, but fall below it after the model is updated. Such samples can provide stable supervised information and offer a normal range of gradients to guide model adaptation. In this paper, we propose a general approach, \underline{POEM}, to promote TTA via ex\underline{\textbf{p}}loring the previously unexpl\underline{\textbf{o}}red reliabl\underline{\textbf{e}} sa\underline{\textbf{m}}ples. Additionally, we introduce an extra Adapt Branch network to strike a balance between extracting domain-agnostic representations and achieving high performance on target data. Comprehensive experiments across multiple architectures demonstrate that POEM consistently outperforms existing TTA methods in both challenging scenarios and real-world domain shifts, while remaining computationally efficient. The effectiveness of POEM is evaluated through extensive analyses and thorough ablation studies. Moreover, the core idea behind POEM can be employed as an augmentation strategy to boost the performance of existing TTA approaches. The source code is publicly available at \emph{https://github.com/ycarobot/POEM}

[676] Interpretable Neuropsychiatric Diagnosis via Concept-Guided Graph Neural Networks

Song Wang, Zhenyu Lei, Zhen Tan, Jundong Li, Javier Rasero, Aiying Zhang, Chirag Agarwal

Main category: cs.LG

TL;DR: CONCEPTNEURO is an interpretable concept-based framework that uses LLMs and neurobiological knowledge to generate functional connectivity concepts for psychiatric disorder diagnosis, outperforming traditional GNNs while providing transparent explanations.

DetailsMotivation: High prevalence of adolescent mental health conditions (1 in 5) and limitations of black-box GNN approaches in clinical translation, requiring interpretable and reliable diagnostic tools.

Method: Leverages LLMs and domain knowledge to automatically generate, filter, and encode interpretable functional connectivity concepts as structured subgraphs linking brain regions, then uses concept classifiers for diagnosis.

Result: CONCEPTNEURO-augmented GNNs consistently outperform vanilla GNNs across multiple psychiatric disorder datasets, improving accuracy while providing clinically aligned explanations and revealing disorder-specific connectivity patterns.

Conclusion: Establishes CONCEPTNEURO as an interpretable, domain-informed framework that enables both strong predictive performance and transparent explanations for psychiatric disorder diagnosis, with potential for generating new research hypotheses.

Abstract: Nearly one in five adolescents currently live with a diagnosed mental or behavioral health condition, such as anxiety, depression, or conduct disorder, underscoring the urgency of developing accurate and interpretable diagnostic tools. Resting-state functional magnetic resonance imaging (rs-fMRI) provides a powerful lens into large-scale functional connectivity, where brain regions are modeled as nodes and inter-regional synchrony as edges, offering clinically relevant biomarkers for psychiatric disorders. While prior works use graph neural network (GNN) approaches for disorder prediction, they remain complex black-boxes, limiting their reliability and clinical translation. In this work, we propose CONCEPTNEURO, a concept-based diagnosis framework that leverages large language models (LLMs) and neurobiological domain knowledge to automatically generate, filter, and encode interpretable functional connectivity concepts. Each concept is represented as a structured subgraph linking specific brain regions, which are then passed through a concept classifier. Our design ensures predictions through clinically meaningful connectivity patterns, enabling both interpretability and strong predictive performance. Extensive experiments across multiple psychiatric disorder datasets demonstrate that CONCEPTNEURO-augmented GNNs consistently outperform their vanilla counterparts, improving accuracy while providing transparent, clinically aligned explanations. Furthermore, concept analyses highlight disorder-specific connectivity patterns that align with expert knowledge and suggest new hypotheses for future investigation, establishing CONCEPTNEURO as an interpretable, domain-informed framework for psychiatric disorder diagnosis.

[677] Deep Reinforcement Learning for Multi-Agent Coordination

Kehinde O. Aina, Sehoon Ha

Main category: cs.LG

TL;DR: Proposes S-MADRL framework using virtual pheromones for decentralized multi-robot coordination in crowded environments, achieving emergent self-organization without explicit communication.

DetailsMotivation: Address coordination challenges in narrow, confined environments where congestion and interference hinder collective task performance, inspired by insect colonies' stigmergic coordination.

Method: Stigmergic Multi-Agent Deep Reinforcement Learning (S-MADRL) with virtual pheromones to model local/social interactions, combined with curriculum learning to decompose complex tasks into progressively harder sub-problems.

Result: Achieves effective coordination of up to eight agents with emergent asymmetric workload distributions that reduce congestion and modulate group performance, outperforming MADQN, MADDPG, and MAPPO.

Conclusion: Demonstrates scalable decentralized multi-agent coordination solution for crowded environments with communication constraints, analogous to natural strategies observed in insect colonies.

Abstract: We address the challenge of coordinating multiple robots in narrow and confined environments, where congestion and interference often hinder collective task performance. Drawing inspiration from insect colonies, which achieve robust coordination through stigmergy – modifying and interpreting environmental traces – we propose a Stigmergic Multi-Agent Deep Reinforcement Learning (S-MADRL) framework that leverages virtual pheromones to model local and social interactions, enabling decentralized emergent coordination without explicit communication. To overcome the convergence and scalability limitations of existing algorithms such as MADQN, MADDPG, and MAPPO, we leverage curriculum learning, which decomposes complex tasks into progressively harder sub-problems. Simulation results show that our framework achieves the most effective coordination of up to eight agents, where robots self-organize into asymmetric workload distributions that reduce congestion and modulate group performance. This emergent behavior, analogous to strategies observed in nature, demonstrates a scalable solution for decentralized multi-agent coordination in crowded environments with communication constraints.

[678] Meta-Awareness Enhances Reasoning Models: Self-Alignment Reinforcement Learning

Yoonjeon Kim, Doohyuk Jang, Eunho Yang

Main category: cs.LG

TL;DR: The paper proposes MASA (Meta-Awareness via Self-Alignment), a training method that enhances language models’ meta-awareness by aligning meta-predictions with true reasoning rollouts, leading to significant performance gains and training efficiency improvements.

DetailsMotivation: Current large reasoning models lack meta-awareness - the ability to know how to think by themselves, as evidenced by severe misalignment between true rollouts and predicted meta information. The authors hypothesize that aligning meta-prediction with true rollouts will improve performance.

Method: The MASA training pipeline leverages self-generated signals to train meta-awareness without external training sources. It improves efficiency by filtering out zero-variance prompts (trivial/unsolvable) and cutting off lengthy rollouts unlikely to lead to correct answers.

Result: Significant improvements in accuracy and training efficiency: 1.28x faster GRPO training to reach same performance, 19.3% accuracy gain on AIME25, 6.2% average gain over six mathematics benchmarks. Enhanced out-of-domain generalization with 3.87% boost on GPQA-Diamond and 2.08% overall gain across 13 benchmarks.

Conclusion: Training with meta-cognitive guidance enhances both in-domain performance and out-of-domain generalization across logical, scientific, and coding domains, demonstrating that improved meta-awareness directly translates to better reasoning capabilities.

Abstract: Recent studies on reasoning models explore the meta-awareness of language models, the ability to know how to think by itself. We argue that large reasoning models lack this meta-awareness property by proving severe misalignment between true rollouts and predicted meta information. We posit that aligning meta-prediction with true rollouts will lead to significant performance gains. To verify this hypothesis, we design a training pipeline that boosts Meta-Awareness via Self-Alignment (MASA), and prove that enhanced meta-awareness directly translates to improved accuracy. Unlike existing meta-cognitive reasoning models, our method does not require external training sources but leverages self-generated signals to train meta-awareness. Moreover, our method enables efficient training by i) filtering out zero-variance prompts that are either trivial or unsolvable and ii) cutting off lengthy rollouts when they are unlikely to lead to correct answers. The results are inspiring: our strategy yields significant improvements in both accuracy and training efficiency on in-domain tasks and shows strong generalization to out-of-domain benchmarks. More specifically, our method can speed up GRPO training by over 1.28x to reach the same performance, and achieve a 19.3% gain in accuracy on AIME25, and a 6.2 % average gain over six mathematics benchmarks. Training with meta-cognitive guidance enhances out-of-domain generalization, giving a 3.87 % boost on GPQA-Diamond and a 2.08 % overall accuracy gain across 13 benchmarks spanning logical, scientific, and coding domains.

[679] Semantic-Inductive Attribute Selection for Zero-Shot Learning

Juan Jose Herrera-Aranda, Guillermo Gomez-Trenado, Francisco Herrera, Isaac Triguero

Main category: cs.LG

TL;DR: This paper introduces a partitioning scheme for Zero-Shot Learning that simulates unseen conditions to assess attribute relevance without access to semantic information from unseen classes, and evaluates two feature-selection strategies that consistently improve accuracy on unseen classes by reducing redundancy.

DetailsMotivation: Semantic spaces in Zero-Shot Learning often contain noisy, redundant, or irrelevant attributes that hinder performance, especially in open-world scenarios where systems must adapt to new tasks dynamically.

Method: Proposed a partitioning scheme to simulate unseen conditions in inductive setting, and studied two feature-selection strategies: embedded feature selection adapted to ZSL (RFS) and evolutionary computation (GA) to explore attribute subsets.

Result: Experiments on five benchmark datasets (AWA2, CUB, SUN, aPY, FLO) show both methods consistently improve accuracy on unseen classes by reducing redundancy, with RFS being efficient but hyperparameter-dependent, and GA being more costly but exploring search space more broadly.

Conclusion: Semantic spaces are inherently redundant and the proposed partitioning scheme is an effective tool to refine them under inductive conditions, with complementary benefits from both feature-selection approaches.

Abstract: Zero-Shot Learning is an important paradigm within General-Purpose Artificial Intelligence Systems, particularly in those that operate in open-world scenarios where systems must adapt to new tasks dynamically. Semantic spaces play a pivotal role as they bridge seen and unseen classes, but whether human-annotated or generated by a machine learning model, they often contain noisy, redundant, or irrelevant attributes that hinder performance. To address this, we introduce a partitioning scheme that simulates unseen conditions in an inductive setting (which is the most challenging), allowing attribute relevance to be assessed without access to semantic information from unseen classes. Within this framework, we study two complementary feature-selection strategies and assess their generalisation. The first adapts embedded feature selection to the particular demands of ZSL, turning model-driven rankings into meaningful semantic pruning; the second leverages evolutionary computation to directly explore the space of attribute subsets more broadly. Experiments on five benchmark datasets (AWA2, CUB, SUN, aPY, FLO) show that both methods consistently improve accuracy on unseen classes by reducing redundancy, but in complementary ways: RFS is efficient and competitive though dependent on critical hyperparameters, whereas GA is more costly yet explores the search space more broadly and avoids such dependence. These results confirm that semantic spaces are inherently redundant and highlight the proposed partitioning scheme as an effective tool to refine them under inductive conditions.

[680] Distributed Area Coverage with High Altitude Balloons Using Multi-Agent Reinforcement Learning

Adam Haroon, Tristan Schuler

Main category: cs.LG

TL;DR: First systematic application of multi-agent reinforcement learning (MARL) to High Altitude Balloon coordination for distributed area coverage, achieving performance comparable to optimal geometric deterministic methods.

DetailsMotivation: Existing deterministic coordination methods perform poorly for smaller HAB teams and localized missions, while coordinated multi-agent reinforcement learning had not been investigated for HAB coordination despite single-agent RL success.

Method: Extended RLHAB simulation environment for multi-agent learning, adapted QMIX algorithm with Centralized Training with Decentralized Execution, using specialized observation spaces and hierarchical rewards prioritizing coverage while encouraging spatial distribution.

Result: QMIX achieves similar performance to theoretically optimal geometric deterministic method for distributed area coverage.

Conclusion: Validates MARL approach for HAB coordination and provides foundation for more complex autonomous multi-HAB missions where deterministic methods become intractable.

Abstract: High Altitude Balloons (HABs) can leverage stratospheric wind layers for limited horizontal control, enabling applications in reconnaissance, environmental monitoring, and communications networks. Existing multi-agent HAB coordination approaches use deterministic methods like Voronoi partitioning and extremum seeking control for large global constellations, which perform poorly for smaller teams and localized missions. While single-agent HAB control using reinforcement learning has been demonstrated on HABs, coordinated multi-agent reinforcement learning (MARL) has not yet been investigated. This work presents the first systematic application of multi-agent reinforcement learning (MARL) to HAB coordination for distributed area coverage. We extend our previously developed reinforcement learning simulation environment (RLHAB) to support cooperative multi-agent learning, enabling multiple agents to operate simultaneously in realistic atmospheric conditions. We adapt QMIX for HAB area coverage coordination, leveraging Centralized Training with Decentralized Execution to address atmospheric vehicle coordination challenges. Our approach employs specialized observation spaces providing individual state, environmental context, and teammate data, with hierarchical rewards prioritizing coverage while encouraging spatial distribution. We demonstrate that QMIX achieves similar performance to the theoretically optimal geometric deterministic method for distributed area coverage, validating the MARL approach and providing a foundation for more complex autonomous multi-HAB missions where deterministic methods become intractable.

[681] Data-Driven Temperature Modelling of Machine Tools by Neural Networks: A Benchmark

C. Coelho, M. Hohmann, D. Fernández, L. Penter, S. Ihlenfeldt, O. Niggemann

Main category: cs.LG

TL;DR: A novel neural network approach for predicting thermal errors in machine tools by modeling temperature and heat flux fields, enabling flexible error correction with reduced hardware requirements.

DetailsMotivation: Traditional thermal error compensation methods are limited in generality and adaptability, being tightly bound to specific error types, locations, or machine configurations.

Method: Train neural networks to predict high-fidelity temperature and heat flux fields using FEM data, with correlation-based sensor selection and benchmarking of various time-series architectures (RNN, GRU, LSTM, BiLSTM, Transformer, TCN).

Result: Accurate and low-cost prediction of temperature and heat flux fields achieved, enabling flexible thermal error correction with minimal hardware requirements.

Conclusion: The proposed framework provides a basis for generalizable thermal error correction in machine tools by predicting thermal fields rather than specific errors, allowing modular downstream error computation.

Abstract: Thermal errors in machine tools significantly impact machining precision and productivity. Traditional thermal error correction/compensation methods rely on measured temperature-deformation fields or on transfer functions. Most existing data-driven compensation strategies employ neural networks (NNs) to directly predict thermal errors or specific compensation values. While effective, these approaches are tightly bound to particular error types, spatial locations, or machine configurations, limiting their generality and adaptability. In this work, we introduce a novel paradigm in which NNs are trained to predict high-fidelity temperature and heat flux fields within the machine tool. The proposed framework enables subsequent computation and correction of a wide range of error types using modular, swappable downstream components. The NN is trained using data obtained with the finite element method under varying initial conditions and incorporates a correlation-based selection strategy that identifies the most informative measurement points, minimising hardware requirements during inference. We further benchmark state-of-the-art time-series NN architectures, namely Recurrent NN, Gated Recurrent Unit, Long-Short Term Memory (LSTM), Bidirectional LSTM, Transformer, and Temporal Convolutional Network, by training both specialised models, tailored for specific initial conditions, and general models, capable of extrapolating to unseen scenarios. The results show accurate and low-cost prediction of temperature and heat flux fields, laying the basis for enabling flexible and generalisable thermal error correction in machine tool environments.

[682] Rethinking Inter-LoRA Orthogonality in Adapter Merging: Insights from Orthogonal Monte Carlo Dropout

Andi Zhang, Xuan Ding, Haofan Wang, Steven McDonagh, Samuel Kaski

Main category: cs.LG

TL;DR: Orthogonal Monte Carlo Dropout enforces strict orthogonality in LoRA merging to prevent interference between semantic vectors, but empirical results show this orthogonality doesn’t achieve semantic compositionality.

DetailsMotivation: When multiple LoRAs are merged (e.g., object + style), their semantic vectors interfere with each other, which Orthogonal Monte Carlo Dropout aims to prevent through enforced orthogonality.

Method: Orthogonal Monte Carlo Dropout mechanism that enforces strict orthogonality when combining sparse semantic vectors without extra time complexity, guaranteeing theoretical and runtime orthogonality.

Result: Empirical analysis reveals that enforced orthogonality does not lead to semantic disentanglement or compositionality, suggesting orthogonality alone is insufficient for true semantic compositionality.

Conclusion: Inter-LoRA orthogonality alone may be insufficient for achieving semantic compositionality, prompting re-examination of its role in adapter merging.

Abstract: We propose Orthogonal Monte Carlo Dropout, a mechanism that enforces strict orthogonality when combining sparse semantic vectors without extra time complexity. LoRA, a popular fine-tuning method for large models, typically trains a module to represent a specific concept such as an object or a style. When multiple LoRAs are merged, for example to generate an object in a particular style, their semantic vectors may interfere with each other. Our method guarantees, at the theoretical and runtime levels, that merged LoRAs remain orthogonal and thus free from direct interference. However, empirical analysis reveals that such orthogonality does not lead to the semantic disentanglement or compositionality highlighted in prior work on compositional adaptation. This finding suggests that inter-LoRA orthogonality alone may be insufficient for achieving true semantic compositionality, prompting a re-examination of its role in adapter merging.

[683] Memory Self-Regeneration: Uncovering Hidden Knowledge in Unlearned Models

Agnieszka Polowczyk, Alicja Polowczyk, Joanna Waczyńska, Piotr Borycki, Przemysław Spurek

Main category: cs.LG

TL;DR: The paper addresses machine unlearning challenges in text-to-image models, introduces Memory Self-Regeneration task and MemoRa strategy for knowledge recovery, and identifies two types of forgetting: short-term and long-term.

DetailsMotivation: Modern text-to-image models can be misused to create harmful content, accelerating the need for machine unlearning to selectively remove specific knowledge while maintaining overall performance.

Method: Introduces Memory Self-Regeneration task and MemoRa strategy as a regenerative approach for recovering lost knowledge. Proposes robustness in knowledge retrieval as a key evaluation measure.

Result: Demonstrates that forgetting occurs in two distinct ways: short-term (quick recall) and long-term (challenging recovery). Shows models can still generate unlearned concepts through adversarial prompts.

Conclusion: Robustness in knowledge retrieval is crucial for developing effective unlearning techniques, and understanding the distinction between short-term and long-term forgetting is essential for machine unlearning research.

Abstract: The impressive capability of modern text-to-image models to generate realistic visuals has come with a serious drawback: they can be misused to create harmful, deceptive or unlawful content. This has accelerated the push for machine unlearning. This new field seeks to selectively remove specific knowledge from a model’s training data without causing a drop in its overall performance. However, it turns out that actually forgetting a given concept is an extremely difficult task. Models exposed to attacks using adversarial prompts show the ability to generate so-called unlearned concepts, which can be not only harmful but also illegal. In this paper, we present considerations regarding the ability of models to forget and recall knowledge, introducing the Memory Self-Regeneration task. Furthermore, we present MemoRa strategy, which we consider to be a regenerative approach supporting the effective recovery of previously lost knowledge. Moreover, we propose that robustness in knowledge retrieval is a crucial yet underexplored evaluation measure for developing more robust and effective unlearning techniques. Finally, we demonstrate that forgetting occurs in two distinct ways: short-term, where concepts can be quickly recalled, and long-term, where recovery is more challenging.

[684] Front-Loading Reasoning: The Synergy between Pretraining and Post-Training Data

Syeda Nahida Akter, Shrimai Prabhumoye, Eric Nyberg, Mostofa Patwary, Mohammad Shoeybi, Yejin Choi, Bryan Catanzaro

Main category: cs.LG

TL;DR: Front-loading reasoning data during pretraining yields 19% average performance gains that cannot be replicated by later-stage SFT, with pretraining benefiting from diverse reasoning patterns and SFT being more sensitive to data quality.

DetailsMotivation: To understand the role of reasoning data in different training stages (pretraining vs post-training) and determine optimal data allocation strategies, given the opacity of current frontier models' training practices.

Method: Systematic study of how reasoning data (varying in scale, diversity, and quality) affects LLM performance when introduced at different training stages, comparing pretraining vs SFT approaches.

Result: Pretraining with reasoning data provides 19% average gain that SFT cannot replicate; pretraining benefits from diverse reasoning patterns (11% gain) while SFT is more sensitive to data quality (15% gain); high-quality pretraining data has latent effects activated after SFT.

Conclusion: Early reasoning data injection during pretraining establishes foundational capabilities that cannot be recovered by later fine-tuning, challenging the conventional separation of language modeling and reasoning and providing principled data allocation guidance.

Abstract: The prevailing paradigm for enhancing the reasoning abilities of LLMs revolves around post-training on high-quality, reasoning-intensive data. While emerging literature suggests that reasoning data is increasingly incorporated also during the mid-training stage-a practice that is relatively more proprietary and less openly characterized-the role of such data in pretraining remains unclear. In particular, due to the opaqueness of pretraining corpora in most frontier models, the effect of reasoning data introduced at different phases of pre- and/or post-training is relatively less reported in the scientific literature. This raises several important questions: Is adding reasoning data earlier during pretraining any better than introducing it during post-training? Could earlier inclusion risk overfitting and harm generalization, or instead establish durable foundations that later fine-tuning cannot recover? We conduct the first systematic study of how reasoning data-varying in scale, diversity, and quality-affects LLM performance when introduced at different stages of training. We find that front-loading reasoning data into pretraining is critical (19% avg gain), establishing foundational capabilities that cannot be fully replicated by later-stage SFT, even with more data. We uncover an asymmetric principle for optimal data allocation: pretraining benefits most from broad diversity in reasoning patterns (11% avg gain), while SFT is more sensitive to data quality (15% avg gain). We show that high-quality pretraining data has latent effects, activated only after SFT, and that naively scaling SFT data can be detrimental, washing away the benefits of early reasoning injection. Our results challenge the conventional separation of language modeling and reasoning, providing a principled guide for strategically allocating data across the entire training pipeline to build more capable models.

[685] MindCraft: How Concept Trees Take Shape In Deep Models

Bowei Tian, Yexiao He, Wanghao Ye, Ziyao Wang, Meng Liu, Ang Li

Main category: cs.LG

TL;DR: The MindCraft framework introduces Concept Trees using spectral decomposition to visualize how concepts hierarchically emerge and separate in foundation models across multiple domains.

DetailsMotivation: To understand how large-scale foundation models internally structure and stabilize concepts, which remains elusive despite their strong performance.

Method: Built Concept Trees using spectral decomposition at each layer to link principal directions into branching Concept Paths, revealing hierarchical concept emergence.

Result: Concept Trees successfully recover semantic hierarchies, disentangle latent concepts, and can be applied across medical diagnosis, physics reasoning, and political decision-making domains.

Conclusion: Concept Trees provide a widely applicable framework for in-depth analysis of conceptual representations in deep models, advancing interpretable AI foundations.

Abstract: Large-scale foundation models demonstrate strong performance across language, vision, and reasoning tasks. However, how they internally structure and stabilize concepts remains elusive. Inspired by causal inference, we introduce the MindCraft framework built upon Concept Trees. By applying spectral decomposition at each layer and linking principal directions into branching Concept Paths, Concept Trees reconstruct the hierarchical emergence of concepts, revealing exactly when they diverge from shared representations into linearly separable subspaces. Empirical evaluations across diverse scenarios across disciplines, including medical diagnosis, physics reasoning, and political decision-making, show that Concept Trees recover semantic hierarchies, disentangle latent concepts, and can be widely applied across multiple domains. The Concept Tree establishes a widely applicable and powerful framework that enables in-depth analysis of conceptual representations in deep models, marking a significant step forward in the foundation of interpretable AI.

[686] Variational Autoencoders-based Detection of Extremes in Plant Productivity in an Earth System Model

Bharat Sharma, Jitendra Kumar

Main category: cs.LG

TL;DR: VAE-based anomaly detection for GPP extremes shows comparable performance to traditional SSA methods, with increasing carbon cycle extremes projected for 2050-80, particularly in Western and Central North America.

DetailsMotivation: Climate anomalies significantly impact terrestrial carbon cycle dynamics, necessitating robust methods for detecting and analyzing anomalous behavior in plant productivity.

Method: Variational autoencoders (VAE) with three dense layers and 12-month input sequences trained on normalized GPP time series, compared against traditional singular spectral analysis (SSA) across three time periods (1850-80, 1950-80, 2050-80) under SSP585 scenario.

Result: Strong regional agreement between VAE and SSA in spatial patterns, with VAE producing higher threshold values (179-756 GgC vs 100-784 GgC). Both methods show increasing magnitudes and frequencies of negative carbon cycle extremes toward 2050-80, especially in Western and Central North America.

Conclusion: VAE approach shows comparable performance to SSA while offering computational advantages and enhanced capability for capturing non-linear temporal dependencies without requiring predefined signal periodicity.

Abstract: Climate anomalies significantly impact terrestrial carbon cycle dynamics, necessitating robust methods for detecting and analyzing anomalous behavior in plant productivity. This study presents a novel application of variational autoencoders (VAE) for identifying extreme events in gross primary productivity (GPP) from Community Earth System Model version 2 simulations across four AR6 regions in the Continental United States. We compare VAE-based anomaly detection with traditional singular spectral analysis (SSA) methods across three time periods: 1850-80, 1950-80, and 2050-80 under the SSP585 scenario. The VAE architecture employs three dense layers and a latent space with an input sequence length of 12 months, trained on a normalized GPP time series to reconstruct the GPP and identifying anomalies based on reconstruction errors. Extreme events are defined using 5th percentile thresholds applied to both VAE and SSA anomalies. Results demonstrate strong regional agreement between VAE and SSA methods in spatial patterns of extreme event frequencies, despite VAE producing higher threshold values (179-756 GgC for VAE vs. 100-784 GgC for SSA across regions and periods). Both methods reveal increasing magnitudes and frequencies of negative carbon cycle extremes toward 2050-80, particularly in Western and Central North America. The VAE approach shows comparable performance to established SSA techniques, while offering computational advantages and enhanced capability for capturing non-linear temporal dependencies in carbon cycle variability. Unlike SSA, the VAE method does not require one to define the periodicity of the signals in the data; it discovers them from the data.

[687] PT$^2$-LLM: Post-Training Ternarization for Large Language Models

Xianglong Yan, Chengzhu Bao, Zhiteng Li, Tianao Zhang, Kaicheng Yang, Haotong Qin, Ruobing Xie, Xingwu Sun, Yulun Zhang

Main category: cs.LG

TL;DR: PT^2-LLM is a post-training ternarization framework for LLMs that achieves competitive performance with 2-bit quantization through iterative ternary fitting, activation-aware grid alignment, and structural similarity-based reordering.

DetailsMotivation: LLMs have large memory and compute demands that hinder deployment. Ternarization offers substantial size reduction and efficiency, but its potential in post-training quantization remains underexplored due to challenges with training-free parameter optimization and quantization difficulties from outliers and dispersed weights.

Method: Proposes PT^2-LLM with Asymmetric Ternary Quantizer featuring: (1) Iterative Ternary Fitting that alternates between optimal ternary grid construction and flexible rounding, (2) Activation-aware Grid Alignment that refines the ternary grid to match full-precision outputs, and (3) Structural Similarity-based Reordering that leverages inter-column similarity to ease quantization and mitigate outlier effects.

Result: Extensive experiments show PT^2-LLM delivers competitive performance against state-of-the-art 2-bit PTQ methods with lower memory cost, while accelerating both prefill and decoding for end-to-end speedup.

Conclusion: PT^2-LLM provides an effective post-training ternarization framework that achieves efficient LLM compression with competitive performance and practical speed improvements.

Abstract: Large Language Models (LLMs) have shown impressive capabilities across diverse tasks, but their large memory and compute demands hinder deployment. Ternarization has gained attention as a promising compression technique, delivering substantial size reduction and high computational efficiency. However, its potential in the post-training quantization (PTQ) setting remains underexplored, due to the challenge of training-free parameter optimization and the quantization difficulty posed by outliers and dispersed weights. To address these issues, we propose PT$^2$-LLM, a post-training ternarization framework tailored for LLMs. At its core is an Asymmetric Ternary Quantizer equipped with a two-stage refinement pipeline: (1) Iterative Ternary Fitting (ITF), which alternates between optimal ternary grid construction and flexible rounding to minimize quantization error, and (2) Activation-aware Grid Alignment (AGA), which further refines the ternary grid to better match full-precision outputs. In addition, we propose a plug-and-play Structural Similarity-based Reordering (SSR) strategy that leverages inter-column structural similarity to ease quantization and mitigate outlier effects, further enhancing overall performance. Extensive experiments demonstrate that PT$^2$-LLM delivers competitive performance against state-of-the-art (SOTA) 2-bit PTQ methods with lower memory cost, while also accelerating both prefill and decoding to achieve end-to-end speedup. The code and models will be available at https://github.com/XIANGLONGYAN/PT2-LLM.

[688] Decrypt Modality Gap in Multimodal Contrastive Learning: From Convergent Representation to Pair Alignment

Lingjie Yi, Raphael Douady, Chao Chen

Main category: cs.LG

TL;DR: This paper provides the first theoretical framework explaining the modality gap in multimodal contrastive learning, identifying dimension collapse as the fundamental cause and showing how it affects downstream performance through sample alignment.

DetailsMotivation: Empirical evidence shows representations from different modalities occupy separate regions (modality gap), but inconsistent findings exist about how this gap influences downstream performance. The paper aims to understand what causes the modality gap and how it affects downstream tasks.

Method: The authors introduce a theoretical framework for analyzing convergent optimal representations in MCL and modality alignment. They prove convergence properties under different constraints: no constraint, cone constraint, and subspace constraint (where dimension collapse occurs).

Result: Without constraints or under cone constraint, modality gap converges to zero. Under subspace constraint (due to dimension collapse), gap converges to smallest angle between hyperplanes. Dimension collapse is identified as the fundamental origin of modality gap. Perfect alignment can be achieved via hyperplane rotation or shared space projection.

Conclusion: Dimension collapse causes the modality gap, which affects downstream performance by influencing sample pair alignment. Perfect alignment between modalities is still achievable through hyperplane rotation or shared space projection despite the subspace constraint.

Abstract: Multimodal contrastive learning (MCL) aims to embed data from different modalities in a shared embedding space. However, empirical evidence shows that representations from different modalities occupy completely separate regions of embedding space, a phenomenon referred to as the modality gap. Moreover, experimental findings on how the size of the modality gap influences downstream performance are inconsistent. These observations raise two key questions: (1) What causes the modality gap? (2) How does it affect downstream tasks? To address these questions, this paper introduces the first theoretical framework for analyzing the convergent optimal representations of MCL and the modality alignment when training is optimized. Specifically, we prove that without any constraint or under the cone constraint, the modality gap converges to zero. Under the subspace constraint (i.e., representations of two modalities fall into two distinct hyperplanes due to dimension collapse), the modality gap converges to the smallest angle between the two hyperplanes. This result identifies \emph{dimension collapse} as the fundamental origin of the modality gap. Furthermore, our theorems demonstrate that paired samples cannot be perfectly aligned under the subspace constraint. The modality gap influences downstream performance by affecting the alignment between sample pairs. We prove that, in this case, perfect alignment between two modalities can still be achieved via two ways: hyperplane rotation and shared space projection.

[689] General Exploratory Bonus for Optimistic Exploration in RLHF

Wendi Li, Changdae Oh, Yixuan Li

Main category: cs.LG

TL;DR: The paper introduces General Exploratory Bonus (GEB), a novel framework that addresses the failure of existing exploratory bonus methods to achieve optimistic exploration in RLHF by counteracting divergence-induced bias.

DetailsMotivation: Current exploratory bonus methods using KL or α-divergence regularization unintentionally bias exploration toward high-probability regions of the reference model, reinforcing conservative behavior instead of promoting discovery of uncertain regions.

Method: Proposed General Exploratory Bonus (GEB) framework that counteracts divergence-induced bias via reference-dependent reward regulation, unifying prior heuristic bonuses as special cases across the full α-divergence family.

Result: GEB consistently outperforms baselines on alignment tasks across multiple divergence settings and large language model backbones, demonstrating improved optimistic exploration.

Conclusion: GEB offers both a principled and practical solution for optimistic exploration in RLHF, addressing the theoretical limitations of existing methods while achieving superior empirical performance.

Abstract: Optimistic exploration is central to improving sample efficiency in reinforcement learning with human feedback, yet existing exploratory bonus methods to incentivize exploration often fail to realize optimism. We provide a theoretical analysis showing that current formulations, under KL or $\alpha$-divergence regularization, unintentionally bias exploration toward high-probability regions of the reference model, thereby reinforcing conservative behavior instead of promoting discovery of uncertain regions. To address this pitfall, we introduce the General Exploratory Bonus (GEB), a novel theoretical framework that provably satisfies the optimism principle. GEB counteracts divergence-induced bias via reference-dependent reward regulation and unifies prior heuristic bonuses as special cases, while extending naturally across the full $\alpha$-divergence family. Empirically, GEB consistently outperforms baselines on alignment tasks across multiple divergence settings and large language model backbones. These results demonstrate that GEB offers both a principled and practical solution for optimistic exploration in RLHF.

[690] CoDA: Coding LM via Diffusion Adaptation

Haolin Chen, Shiyu Wang, Can Qin, Bo Pang, Zuxin Liu, Jielin Qiu, Jianguo Zhang, Yingbo Zhou, Zeyuan Chen, Ran Xu, Shelby Heinecke, Silvio Savarese, Caiming Xiong, Huan Wang, Weiran Yao

Main category: cs.LG

TL;DR: CoDA is a 1.7B-parameter diffusion language model for code that achieves competitive performance with larger models through efficient training and inference techniques.

DetailsMotivation: Diffusion language models offer bidirectional context and infilling capabilities that autoregressive models lack, but existing systems are too heavyweight for practical use.

Method: Uses large-scale diffusion pre-training with code-centric mid-training and instruction tuning, plus confidence-guided sampling to maintain competitive inference latency.

Result: CoDA-1.7B-Instruct matches or surpasses diffusion models up to 7B parameters on Humaneval, MBPP, and EvalPlus benchmarks.

Conclusion: The release includes model checkpoints, evaluation tools, and TPU training pipelines to advance research on lightweight diffusion-based coding assistants.

Abstract: Diffusion language models promise bidirectional context and infilling capabilities that autoregressive coders lack, yet practical systems remain heavyweight. We introduce CoDA, a 1.7B-parameter diffusion coder trained on TPU with a fully open-source training pipeline. CoDA pairs large-scale diffusion pre-training with code-centric mid-training and instruction tuning, enabling confidence-guided sampling that keeps inference latency competitive. On Humaneval, MBPP, and EvalPlus, CoDA-1.7B-Instruct matches or surpasses diffusion models up to 7B parameters. Our release includes model checkpoints, evaluation harnesses, and TPU training pipelines to accelerate research on lightweight diffusion-based coding assistants.

[691] Decision Potential Surface: A Theoretical and Practical Approximation of LLM’s Decision Boundary

Zi Liang, Zhiyao Wu, Haoyang Shang, Yulin Jin, Qingqing Ye, Huadi Zheng, Peizhao Hu, Haibo Hu

Main category: cs.LG

TL;DR: Proposes Decision Potential Surface (DPS) as a computationally feasible alternative to directly constructing decision boundaries for large language models, enabling analysis through finite sampling.

DetailsMotivation: Direct construction of decision boundaries for LLMs is computationally infeasible due to enormous vocabulary-sequence sizes and auto-regressive nature, limiting model analysis and interpretation.

Method: Defines Decision Potential Surface (DPS) based on confidence in distinguishing sampling sequences, proves equivalence to decision boundary, and develops K-DPS algorithm using K-finite sampling to approximate boundaries with bounded error.

Result: Theoretical derivation of error bounds shows approximation errors can be traded off with sampling times, with empirical validation across various LLMs and corpora demonstrating practical feasibility.

Conclusion: DPS provides a computationally tractable framework for analyzing LLM decision boundaries through finite sampling, enabling previously infeasible model analysis with controllable approximation errors.

Abstract: Decision boundary, the subspace of inputs where a machine learning model assigns equal classification probabilities to two classes, is pivotal in revealing core model properties and interpreting behaviors. While analyzing the decision boundary of large language models (LLMs) has raised increasing attention recently, constructing it for mainstream LLMs remains computationally infeasible due to the enormous vocabulary-sequence sizes and the auto-regressive nature of LLMs. To address this issue, in this paper we propose Decision Potential Surface (DPS), a new notion for analyzing LLM decision boundary. DPS is defined on the confidences in distinguishing different sampling sequences for each input, which naturally captures the potential of decision boundary. We prove that the zero-height isohypse in DPS is equivalent to the decision boundary of an LLM, with enclosed regions representing decision regions. By leveraging DPS, for the first time in the literature, we propose an approximate decision boundary construction algorithm, namely $K$-DPS, which only requires K-finite times of sequence sampling to approximate an LLM’s decision boundary with negligible error. We theoretically derive the upper bounds for the absolute error, expected error, and the error concentration between K-DPS and the ideal DPS, demonstrating that such errors can be trade-off with sampling times. Our results are empirically validated by extensive experiments across various LLMs and corpora.

[692] PDE-Transformer: A Continuous Dynamical Systems Approach to Sequence Modeling

Yukun Zhang, Xueqing Zhou

Main category: cs.LG

TL;DR: This paper presents a theoretical framework that models the Transformer architecture as a continuous spatiotemporal dynamical system governed by a PDE, revealing that residual connections and layer normalization are mathematically necessary stabilization mechanisms rather than heuristic tricks.

DetailsMotivation: To develop a principled theoretical understanding of the Transformer's internal mechanisms, which currently lacks rigorous mathematical explanation despite its revolutionary impact on AI.

Method: The authors introduce a novel analytical framework that maps the Transformer’s discrete layered structure to a continuous PDE system, where self-attention becomes a non-local interaction operator, feed-forward networks become local reactions, and residual connections/layer normalization are modeled as stabilization mechanisms. They use this PDE system as a theoretical probe to analyze mathematical necessity.

Result: Experiments comparing standard Transformers with PDE simulators lacking explicit stabilizers show that without residual connections, the system suffers catastrophic representational drift, and without layer normalization, training becomes unstable and explosive.

Conclusion: Residual connections and layer normalization are fundamental mathematical stabilizers required to control an otherwise powerful but inherently unstable continuous system, providing a first-principles explanation for the Transformer’s design and establishing a new paradigm for analyzing deep neural networks through continuous dynamics.

Abstract: The Transformer architecture has revolutionized artificial intelligence, yet a principled theoretical understanding of its internal mechanisms remains elusive. This paper introduces a novel analytical framework that reconceptualizes the Transformer’s discrete, layered structure as a continuous spatiotemporal dynamical system governed by a master Partial Differential Equation (PDE). Within this paradigm, we map core architectural components to distinct mathematical operators: self-attention as a non-local interaction, the feed-forward network as a local reaction, and, critically, residual connections and layer normalization as indispensable stabilization mechanisms. We do not propose a new model, but rather employ the PDE system as a theoretical probe to analyze the mathematical necessity of these components. By comparing a standard Transformer with a PDE simulator that lacks explicit stabilizers, our experiments provide compelling empirical evidence for our central thesis. We demonstrate that without residual connections, the system suffers from catastrophic representational drift, while the absence of layer normalization leads to unstable, explosive training dynamics. Our findings reveal that these seemingly heuristic “tricks” are, in fact, fundamental mathematical stabilizers required to tame an otherwise powerful but inherently unstable continuous system. This work offers a first-principles explanation for the Transformer’s design and establishes a new paradigm for analyzing deep neural networks through the lens of continuous dynamics.

[693] Learning without Global Backpropagation via Synergistic Information Distillation

Chenhao Ye, Ming Tang

Main category: cs.LG

TL;DR: Synergistic Information Distillation (SID) is a novel training framework that addresses BP’s scalability bottlenecks by reframing deep learning as local cooperative refinement problems, enabling parallel training while preserving standard inference.

DetailsMotivation: To overcome BP's update locking (network modules idle during backward pass) and high memory consumption from storing activations for gradient computation.

Method: Structures deep network as pipeline of modules with local objectives to refine probabilistic beliefs about targets, balancing target fidelity with consistency to preceding module’s belief, thus decoupling backward dependencies.

Result: Eliminates update locking, drastically reduces memory requirements, guarantees monotonic performance improvement with depth, and matches/surpasses BP classification accuracy with superior scalability and robustness to label noise.

Conclusion: SID serves as a versatile drop-in replacement for BP that maintains standard feed-forward inference while solving BP’s scalability issues through local cooperative refinement.

Abstract: Backpropagation (BP), while foundational to deep learning, imposes two critical scalability bottlenecks: update locking, where network modules remain idle until the entire backward pass completes, and high memory consumption due to storing activations for gradient computation. To address these limitations, we introduce Synergistic Information Distillation (SID), a novel training framework that reframes deep learning as a cascade of local cooperative refinement problems. In SID, a deep network is structured as a pipeline of modules, each imposed with a local objective to refine a probabilistic belief about the ground-truth target. This objective balances fidelity to the target with consistency to the belief from its preceding module. By decoupling the backward dependencies between modules, SID enables parallel training and hence eliminates update locking and drastically reduces memory requirements. Meanwhile, this design preserves the standard feed-forward inference pass, making SID a versatile drop-in replacement for BP. We provide a theoretical foundation, proving that SID guarantees monotonic performance improvement with network depth. Empirically, SID consistently matches or surpasses the classification accuracy of BP, exhibiting superior scalability and pronounced robustness to label noise.Code is available at: https://github.com/ychAlbert/sid-bp

[694] Quant-dLLM: Post-Training Extreme Low-Bit Quantization for Diffusion Large Language Models

Tianao Zhang, Zhiteng Li, Xianglong Yan, Haotong Qin, Yong Guo, Yulun Zhang

Main category: cs.LG

TL;DR: Quant-dLLM is a post-training quantization framework for diffusion large language models that enables effective 2-bit weight compression through masked calibration simulation, data-aware quantization, and adaptive mixed precision allocation.

DetailsMotivation: Diffusion LLMs are emerging as alternatives to autoregressive LLMs but face similar model size growth issues. Standard PTQ methods fail at 2-bit precision for dLLMs due to their unique masked-denoising activations that differ from fully visible signals in AR LLMs.

Method: Three key components: 1) Masked Calibration Simulation (MCS) to align calibration with timestep-dependent masking, 2) Data-aware Any-order Quantizer (DAQ) that learns ultra-low-bit representations via optimization, and 3) Adaptive Blockwise Mixed Precision (ABMP) for sensitivity-based bit allocation across channel groups.

Result: Quant-dLLM consistently achieves higher accuracy than state-of-the-art AR-transfer PTQ methods when restricted to 2-bit precision on dLLMs.

Conclusion: The proposed framework successfully addresses the unique challenges of quantizing diffusion LLMs and enables effective ultra-low-bit deployment while maintaining performance.

Abstract: Diffusion large language models (dLLMs), which offer bidirectional context and flexible masked-denoising generation, are emerging as a compelling alternative to autoregressive (AR) LLMs. However, like AR LLMs, their model sizes continue to grow, motivating weight compression for deployment. Although post-training quantization (PTQ) is effective for AR LLMs, directly transferring it to dLLMs at 2-bit leads to unsatisfactory performance. To tackle these challenges, we propose Quant-dLLM, an ultra-low-bit PTQ framework tailored to dLLMs. Since masked-denoising activations in dLLMs differ from the fully visible signals assumed by standard PTQ methods, we introduce Masked Calibration Simulation (MCS) to align calibration with the timestep-dependent masking, which yields more reliable calibrations. Moreover, we propose a Data-aware Any-order Quantizer (DAQ) that learns ultra-low-bit weight representations via an optimization algorithm. It performs iterative approximation guided by our simulated calibration data. In addition, under a strict 2-bit budget, we introduce Adaptive Blockwise Mixed Precision (ABMP), a sensitivity-based precision allocation scheme that adaptively assigns bit width across channel groups. When restricted to 2-bit precision, Quant-dLLM consistently achieves higher accuracy than state-of-the-art (SOTA) AR-transfer PTQ methods on dLLMs. The code and models will be available at: https://github.com/ZTA2785/Quant-dLLM.

[695] SDQ-LLM: Sigma-Delta Quantization for 1-bit LLMs of any size

Junhao Xia, Ming Zhao, Limin Xiao, Xiujun Zhang

Main category: cs.LG

TL;DR: SDQ-LLM is a novel framework for 1-bit quantization of large language models using Sigma-Delta quantization with adjustable Over-Sampling Ratio, enabling efficient deployment while preserving reasoning capabilities.

DetailsMotivation: LLMs face significant computational and memory challenges, making extremely low-bit quantization crucial for efficient deployment while maintaining linguistic reasoning capabilities.

Method: Uses upsampling with Sigma-Delta Quantizer to binarize/ternarize weights, Hadamard-based weight smoothing, and MultiOSR strategy for layer-wise OSR allocation based on weight variance and parameter scale.

Result: Extensive experiments on OPT and LLaMA model families show SDQ-LLM achieves efficient and high-precision performance even under aggressive low-OSR settings.

Conclusion: SDQ-LLM enables extremely low-bit quantization of LLMs with continuous OSR adjustability, providing optimal trade-off between model size and accuracy for efficient deployment.

Abstract: Large language models (LLMs) face significant computational and memory challenges, making extremely low-bit quantization crucial for their efficient deployment. In this work, we introduce SDQ-LLM: Sigma-Delta Quantization for 1-bit LLMs of any size, a novel framework that enables extremely low-bit quantization of LLMs while preserving their linguistic reasoning capabilities. A distinctive feature of SDQ-LLM is the continuous adjustability of the Over-Sampling Ratio (OSR), enabling dynamic adaptation to memory or VRAM constraints by selecting fractional OSR (e.g. 2.5 times) for an optimal trade-off between model size and accuracy. SDQ-LLM uses upsampling combined with Sigma-Delta Quantizer to binarize or ternarize LLMs weights, encoding high-precision parameters into 1-bit or 1.58-bit representations, replacing the multiplication operations within linear layers with addition. This approach significantly enhances inference efficiency under extremely low-bit quantization. To further reduce the loss of quantization precision, we incorporate Hadamard-based weight smoothing prior to quantization, improving the stability and robustness of the weight representations. Furthermore, to fully leverage the continuity of the OSR and reduce precision loss, recognizing the correlation between quantization sensitivity and weight variance, we propose a fine-grained, layer- and linear-wise OSR allocation strategy, MultiOSR. This strategy distributes OSR both across layers and within each layer, based on weight variance and parameter scale. Finally, extensive experiments on OPT and LLaMA model families demonstrate that SDQ-LLM achieves a more efficient and high-precision performance even under highly aggressive low-OSR settings. Our code is available at https://github.com/Dreamlittlecat/LLM-Quant-Factory.

[696] QuadEnhancer: Leveraging Quadratic Transformations to Enhance Deep Neural Networks

Qian Chen, Linxin Yang, Akang Wang, Xiaodong Luo, Yin Zhang

Main category: cs.LG

TL;DR: The paper proposes a lightweight quadratic enhancer that introduces quadratic transformations in neural networks to increase nonlinearity, using low-rankness, weight sharing, and sparsification to minimize parameter and computational overhead.

DetailsMotivation: To enhance the performance of existing deep neural network architectures by increasing nonlinearity through quadratic transformations, while maintaining computational efficiency.

Method: A lightweight quadratic enhancer that introduces quadratic interactions between features at every layer using low-rankness, weight sharing, and sparsification techniques to minimize additional parameters and computations.

Result: Substantial performance gains across three tasks: image classification, text classification, and fine-tuning large-language models, with negligible additional model parameters and forward computations.

Conclusion: The proposed quadratic enhancer effectively improves neural network performance across multiple domains while maintaining computational efficiency through careful design choices.

Abstract: The combination of linear transformations and non-linear activation functions forms the foundation of most modern deep neural networks, enabling them to approximate highly complex functions. This paper explores the introduction of quadratic transformations to further increase nonlinearity in neural networks, with the aim of enhancing the performance of existing architectures. To reduce parameter complexity and computational complexity, we propose a lightweight quadratic enhancer that uses low-rankness, weight sharing, and sparsification techniques. For a fixed architecture, the proposed approach introduces quadratic interactions between features at every layer, while only adding negligible amounts of additional model parameters and forward computations. We conduct a set of proof-of-concept experiments for the proposed method across three tasks: image classification, text classification, and fine-tuning large-language models. In all tasks, the proposed approach demonstrates clear and substantial performance gains.

[697] Quantifying constraint hierarchies in Bayesian PINNs via per-constraint Hessian decomposition

Filip Landgren

Main category: cs.LG

TL;DR: A scalable Laplace framework is introduced to analyze how physical constraints affect uncertainty and curvature in Bayesian physics-informed neural networks (B-PINNs), showing how individual constraints shape the loss landscape.

DetailsMotivation: To clarify how physical constraints affect uncertainty interpretation in B-PINNs, addressing concerns about overconfidence and the poorly understood effects of constraints on network behavior.

Method: A scalable, matrix-free Laplace framework that decomposes the posterior Hessian into contributions from each constraint, providing metrics to quantify their relative influence on the loss landscape.

Result: Applied to the Van der Pol equation, the method tracks how constraints sculpt network geometry and shows how changing a single loss weight non-trivially redistributes curvature and effective dominance across other constraints.

Conclusion: The framework enables better understanding of how physical constraints shape B-PINNs’ uncertainty and curvature, addressing interpretation challenges in these networks.

Abstract: Bayesian physics-informed neural networks (B-PINNs) merge data with governing equations to solve differential equations under uncertainty. However, interpreting uncertainty and overconfidence in B-PINNs requires care due to the poorly understood effects the physical constraints have on the network; overconfidence could reflect warranted precision, enforced by the constraints, rather than miscalibration. Motivated by the need to further clarify how individual physical constraints shape these networks, we introduce a scalable, matrix-free Laplace framework that decomposes the posterior Hessian into contributions from each constraint and provides metrics to quantify their relative influence on the loss landscape. Applied to the Van der Pol equation, our method tracks how constraints sculpt the network’s geometry and shows, directly through the Hessian, how changing a single loss weight non-trivially redistributes curvature and effective dominance across the others.

[698] Chasing Moving Targets with Online Self-Play Reinforcement Learning for Safer Language Models

Mickel Liu, Liwei Jiang, Yancheng Liang, Simon Shaolei Du, Yejin Choi, Tim Althoff, Natasha Jaques

Main category: cs.LG

TL;DR: Self-RedTeam is an online self-play RL algorithm that treats safety alignment as a two-player zero-sum game, enabling continuous co-evolution of attacker and defender agents within a single model to achieve dynamic safety improvements.

DetailsMotivation: To overcome the reactive, disjoint nature of conventional safety alignment where attackers exploit static models and defenders perpetually lag behind emerging threats.

Method: Uses self-play reinforcement learning where a single model alternates between attacker and defender roles, with a reward LM adjudicating outcomes. Includes hidden Chain-of-Thought for private planning.

Result: Achieves 21.8% increase in adversarial diversity and 65.5% higher robustness on safety benchmarks compared to static approaches. Reduces over-refusals and enables scalable autonomous improvement.

Conclusion: Proposes shifting from reactive patching to proactive co-evolution in LM safety training, enabling robust self-improvement via multi-agent reinforcement learning.

Abstract: Conventional language model (LM) safety alignment relies on a reactive, disjoint procedure: attackers exploit a static model, followed by defensive fine-tuning to patch exposed vulnerabilities. This sequential approach creates a mismatch – attackers overfit to obsolete defenses, while defenders perpetually lag behind emerging threats. To address this, we propose Self-RedTeam, an online self-play reinforcement learning algorithm where an attacker and defender agent co-evolve through continuous interaction. We cast safety alignment as a two-player zero-sum game, where a single model alternates between attacker and defender roles – generating adversarial prompts and safeguarding against them – while a reward LM adjudicates outcomes. This enables dynamic co-adaptation. Grounded in the game-theoretic framework of zero-sum games, we establish a theoretical safety guarantee which motivates the design of our method: if self-play converges to a Nash Equilibrium, the defender will reliably produce safe responses to any adversarial input. Empirically, Self-RedTeam uncovers more diverse attacks (+21.8% SBERT) compared to attackers trained against static defenders and achieves higher robustness on safety benchmarks (e.g., +65.5% on WildJailBreak) than defenders trained against static attackers. We further propose hidden Chain-of-Thought, allowing agents to plan privately, which boosts adversarial diversity and reduces over-refusals. Our results motivate a shift from reactive patching to proactive co-evolution in LM safety training, enabling scalable, autonomous, and robust self-improvement of LMs via multi-agent reinforcement learning (MARL).

[699] MemMamba: Rethinking Memory Patterns in State Space Model

Youjin Wang, Yangjingyi Chen, Jiahao Yan, Jiaxuan Lu, Xiao Sun

Main category: cs.LG

TL;DR: MemMamba addresses Mamba’s long-range memory decay by integrating state summarization and cross-layer attention, achieving better performance on long sequences while maintaining linear complexity.

DetailsMotivation: Existing methods for long-sequence modeling face trade-offs between efficiency and memory - RNNs have gradient issues, Transformers have quadratic complexity, and Mamba has exponential memory decay. There's a need for models that can handle ultra-long sequences efficiently.

Method: Proposed MemMamba framework with state summarization mechanism and cross-layer/cross-token attention to alleviate long-range forgetting while preserving linear complexity. Introduced horizontal-vertical memory fidelity metrics to quantify information loss.

Result: Significant improvements over Mamba variants and Transformers on PG19 and Passkey Retrieval benchmarks, with 48% inference speedup. Achieves breakthrough in complexity-memory trade-off.

Conclusion: MemMamba offers a new paradigm for ultra-long sequence modeling by solving the memory decay problem while maintaining efficiency, demonstrated through both theoretical analysis and empirical results.

Abstract: With the explosive growth of data, long-sequence modeling has become increasingly important in tasks such as natural language processing and bioinformatics. However, existing methods face inherent trade-offs between efficiency and memory. Recurrent neural networks suffer from gradient vanishing and explosion, making them hard to scale. Transformers can model global dependencies but are constrained by quadratic complexity. Recently, selective state-space models such as Mamba have demonstrated high efficiency with O(n) time and O(1) recurrent inference, yet their long-range memory decays exponentially. In this work, we conduct mathematical derivations and information-theoretic analysis to systematically uncover the memory decay mechanism of Mamba, answering a fundamental question: what is the nature of Mamba’s long-range memory and how does it retain information? To quantify key information loss, we further introduce horizontal-vertical memory fidelity metrics that capture degradation both within and across layers. Inspired by how humans distill and retain salient information when reading long documents, we propose MemMamba, a novel architectural framework that integrates state summarization mechanism together with cross-layer and cross-token attention, which alleviates long-range forgetting while preserving linear complexity. MemMamba achieves significant improvements over existing Mamba variants and Transformers on long-sequence benchmarks such as PG19 and Passkey Retrieval, while delivering a 48% speedup in inference efficiency. Both theoretical analysis and empirical results demonstrate that MemMamba achieves a breakthrough in the complexity-memory trade-off, offering a new paradigm for ultra-long sequence modeling.

[700] GUIDE: Towards Scalable Advising for Research Ideas

Yaowenqi Liu, Bingxu Meng, Rui Pan, Yuxing Liu, Jerry Huang, Jiaxuan You, Tong Zhang

Main category: cs.LG

TL;DR: A small model with compressed literature database and structured reasoning framework outperforms large models like Deepseek-R1 in ICLR 2025 paper evaluation, achieving over 90% acceptance rate for high-confidence predictions.

DetailsMotivation: Address the gap in scalable advising systems for providing high-quality feedback on hypotheses and experimental designs in AI research.

Method: Explore key factors including model size, context length, confidence estimation, and structured reasoning processes. Use a small model with compressed literature database and structured reasoning framework.

Result: The system outperforms Deepseek-R1 in acceptance rates for self-ranked top-30% ICLR 2025 submissions. With high-confidence predictions, achieves over 90% acceptance rate on ICLR 2025 test set.

Conclusion: The approach significantly enhances quality and efficiency of hypothesis generation and experimental design, demonstrating that well-designed small models can outperform larger general-purpose models in specific research advising tasks.

Abstract: The field of AI research is advancing at an unprecedented pace, enabling automated hypothesis generation and experimental design across diverse domains such as biology, mathematics, and artificial intelligence. Despite these advancements, there remains a significant gap in the availability of scalable advising systems capable of providing high-quality, well-reasoned feedback to refine proposed hypotheses and experimental designs. To address this challenge, we explore key factors that underlie the development of robust advising systems, including model size, context length, confidence estimation, and structured reasoning processes. Our findings reveal that a relatively small model, when equipped with a well-compressed literature database and a structured reasoning framework, can outperform powerful general-purpose language models such as Deepseek-R1 in terms of acceptance rates for self-ranked top-30% submissions to ICLR 2025. Moreover, when limited to high-confidence predictions, our system achieves an acceptance rate exceeding 90% on the ICLR 2025 test set, underscoring its potential to significantly enhance the quality and efficiency of hypothesis generation and experimental design. The code is released at https://github.com/HowardLiu0830/GUIDE-Research-Idea-Evaluation.

[701] Training Optimal Large Diffusion Language Models

Jinjie Ni, Qian Liu, Chao Du, Longxu Dou, Hang Yan, Zili Wang, Tianyu Pang, Michael Qizhe Shieh

Main category: cs.LG

TL;DR: Quokka introduces the first systematic scaling law for diffusion language models (DLMs), covering both compute-constrained and data-constrained regimes, and studying key modeling and optimization designs.

DetailsMotivation: To provide practical guidance for DLM training and inspire the broader AI community by establishing systematic scaling laws similar to Chinchilla but with wider scope.

Method: Developed Quokka scaling law framework that encompasses both compute-constrained and data-constrained regimes, analyzing key modeling and optimization designs for diffusion language models.

Result: Created the first systematic scaling law specifically for diffusion language models, extending beyond Chinchilla’s scope to provide comprehensive guidance for DLM training.

Conclusion: Quokka serves as a valuable framework that offers both immediate practical guidance for DLM training and long-term inspiration for the AI community, positioning itself as a companion to Chinchilla with broader applicability.

Abstract: We introduce Quokka, the first systematic scaling law for diffusion language models (DLMs), encompassing both compute-constrained and data-constrained regimes, and studying the key modeling and optimization designs. Quokka is a good friend of Chinchilla and provides wider scopes. We hope the results would bring short-term practical guidance in DLMs training and long-term inspirations for the whole AI community.

[702] Discovering Transformer Circuits via a Hybrid Attribution and Pruning Framework

Hao Gu, Vibhas Nair, Amrithaa Ashok Kumar, Jayvart Sharma, Ryan Lagasse

Main category: cs.LG

TL;DR: HAP is a hybrid circuit discovery framework that combines attribution patching for speed and edge pruning for faithfulness, achieving 46% faster performance while maintaining circuit quality.

DetailsMotivation: Existing circuit discovery methods face a trade-off between speed (attribution patching) and faithfulness (edge pruning), limiting scalability for larger models.

Method: Hybrid attribution and pruning (HAP) framework that first uses attribution patching to identify a high-potential subgraph, then applies edge pruning to extract a faithful circuit from it.

Result: HAP is 46% faster than baseline algorithms without sacrificing circuit faithfulness, and preserves cooperative circuit components that attribution methods prune at high sparsity.

Conclusion: HAP could be an effective approach for improving the scalability of mechanistic interpretability research to larger models.

Abstract: Interpreting language models often involves circuit analysis, which aims to identify sparse subnetworks, or circuits, that accomplish specific tasks. Existing circuit discovery algorithms face a fundamental trade-off: attribution patching is fast but unfaithful to the full model, while edge pruning is faithful but computationally expensive. This research proposes a hybrid attribution and pruning (HAP) framework that uses attribution patching to identify a high-potential subgraph, then applies edge pruning to extract a faithful circuit from it. We show that HAP is 46% faster than baseline algorithms without sacrificing circuit faithfulness. Furthermore, we present a case study on the Indirect Object Identification task, showing that our method preserves cooperative circuit components (e.g. S-inhibition heads) that attribution patching methods prune at high sparsity. Our results show that HAP could be an effective approach for improving the scalability of mechanistic interpretability research to larger models. Our code is available at https://anonymous.4open.science/r/HAP-circuit-discovery.

[703] MACE: A Hybrid LLM Serving System with Colocated SLO-aware Continuous Retraining Alignment

Yufei Li, Yu Fu, Yue Dong, Cong Liu

Main category: cs.LG

TL;DR: MACE is a hybrid LLM system that colocates concurrent inference and fine-tuning with intelligent memory management to balance inference latency and model accuracy under constrained GPU resources on edge servers.

DetailsMotivation: The non-stationary nature of user data requires frequent retraining of LLMs, creating tension between inference latency and model accuracy under limited GPU resources. Existing retraining strategies either delay updates, over-commit resources, or overlook iteration-level granularity.

Method: MACE uses iteration-level scheduling to adapt retraining frequency to model drift without violating SLOs. It colocates concurrent inference (prefill, decode) and fine-tuning with intelligent memory management, leveraging the insight that not all model updates equally affect output alignment.

Result: Trace-driven evaluation shows MACE matches or exceeds continuous retraining while reducing inference latency by up to 63% and maintaining throughput under resource constraints. It improves latency breakdown across prefill, decode, and finetune stages, and sustains GPU utilization above 85% on NVIDIA AGX Orin.

Conclusion: Iteration-level hybrid scheduling is a promising direction for deploying LLMs with continual learning capabilities on edge platforms, effectively balancing throughput, latency, and update freshness.

Abstract: Large language models (LLMs) deployed on edge servers are increasingly used in latency-sensitive applications such as personalized assistants, recommendation, and content moderation. However, the non-stationary nature of user data necessitates frequent retraining, which introduces a fundamental tension between inference latency and model accuracy under constrained GPU resources. Existing retraining strategies either delay model updates, over-commit resources to retraining, or overlook iteration-level retraining granularity. In this paper, we identify that iteration-level scheduling is crucial for adapting retraining frequency to model drift without violating service-level objectives (SLOs). We propose MACE, a hybrid LLM system that colocates concurrent inference (prefill, decode) and fine-tuning, with intelligent memory management to maximize task performance while promising inference throughput. MACE leverages the insight that not all model updates equally affect output alignment and allocates GPU cycles accordingly to balance throughput, latency, and update freshness. Our trace-driven evaluation shows that MACE matches or exceeds continuous retraining while reducing inference latency by up to 63% and maintaining throughput under resource constraints. Compared to periodic retraining, MACE improves latency breakdown across prefill, decode, and finetune stages, and sustains GPU utilization above 85% in NVIDIA AGX Orin. These results demonstrate that iteration-level hybrid scheduling is a promising direction for deploying LLMs with continual learning capabilities on edge platforms.

[704] Edge-FIT: Federated Instruction Tuning of Quantized LLMs for Privacy-Preserving Smart Home Environments

Vinay Venkatesh, Vamsidhar R Kamanuru, Lav Kumar, Nikita Kothari

Main category: cs.LG

TL;DR: Edge-FIT is a scalable framework for federated instruction tuning of LLMs on edge devices, using 4-bit QLORA to overcome communication and computational challenges of traditional federated learning with large models.

DetailsMotivation: Traditional Federated Learning methods like FedAvg fail with LLMs due to their massive parameter size, creating communication and computational overhead that makes decentralized deployment impractical.

Method: Combines federated learning with 4-bit Quantized Low-Rank Adaptation (QLORA), filtering the Databricks Dolly 15k dataset for IoT domain, and testing on Llama 2(7B) and Phi-3-mini(3.8B) models.

Result: Edge-FIT tuned Llama 2(7B) achieves F1-Score of 0.89, and demonstrates viable trade-off with Phi-3-mini model, enabling scalable decentralized LLM deployment on home compute gateways.

Conclusion: Edge-FIT validates as a scalable framework for federated instruction tuning of LLMs on edge devices, overcoming traditional FL limitations through quantization and adaptation techniques.

Abstract: This paper proposes Edge-FIT (Federated Instruction Tuning on the Edge), a scalable framework for Federated Instruction Tuning (FIT) of Large Language Models (LLMs). Traditional Federated Learning (TFL) methods, like FedAvg, fail when confronted with the massive parameter size of LLMs [3], [6]. Our Edge-FIT framework combines federated learning with 4-bit Quantized Low-Rank Adaptation (QLORA), mitigating the core issues of communication and computational overhead. We demonstrate this by filtering the general-purpose Databricks Dolly 15k dataset for the IoT domain. Experimental results show the Edge-FIT tuned Llama 2(7B) achieves an F1-Score of 0.89. We also demonstrate a viable trade-off using the 3.8B Phi-3-mini model, validating Edge-FIT as a scalable framework for decentralized LLM deployment on home compute gateways.

[705] LogAction: Consistent Cross-system Anomaly Detection through Logs via Active Domain

Chiming Duan, Minghua He, Pei Xiao, Tong Jia, Xin Zhang, Zhewei Zhong, Xiang Luo, Yan Niu, Lingzhe Zhang, Yifan Wu, Siyu Yu, Weijie Hong, Ying Li, Gang Huang

Main category: cs.LG

TL;DR: LogAction is an active domain adaptation model for log-based anomaly detection that combines transfer learning and active learning to address labeling challenges and data distribution gaps.

DetailsMotivation: Existing log-based anomaly detection methods rely heavily on labeling, which is challenging for large log volumes. Transfer learning and active learning approaches face issues with data distribution gaps and cold-start problems.

Method: LogAction integrates transfer learning (using labeled data from mature systems) and active learning (using free energy-based and uncertainty-based sampling to select boundary logs for manual labeling).

Result: Experimental results on six dataset combinations show LogAction achieves 93.01% average F1 score with only 2% manual labels, outperforming state-of-the-art methods by 26.28%.

Conclusion: LogAction effectively addresses the labeling challenge and data distribution gap in log-based anomaly detection through active domain adaptation, achieving high performance with minimal human labeling effort.

Abstract: Log-based anomaly detection is a essential task for ensuring the reliability and performance of software systems. However, the performance of existing anomaly detection methods heavily relies on labeling, while labeling a large volume of logs is highly challenging. To address this issue, many approaches based on transfer learning and active learning have been proposed. Nevertheless, their effectiveness is hindered by issues such as the gap between source and target system data distributions and cold-start problems. In this paper, we propose LogAction, a novel log-based anomaly detection model based on active domain adaptation. LogAction integrates transfer learning and active learning techniques. On one hand, it uses labeled data from a mature system to train a base model, mitigating the cold-start issue in active learning. On the other hand, LogAction utilize free energy-based sampling and uncertainty-based sampling to select logs located at the distribution boundaries for manual labeling, thus addresses the data distribution gap in transfer learning with minimal human labeling efforts. Experimental results on six different combinations of datasets demonstrate that LogAction achieves an average 93.01% F1 score with only 2% of manual labels, outperforming some state-of-the-art methods by 26.28%. Website: https://logaction.github.io

[706] Why mask diffusion does not work

Haocheng Sun, Cynthia Xin Wen, Edward Hong Wang

Main category: cs.LG

TL;DR: Mask diffusion language models face inherent difficulties in achieving parallel generation and bidirectional attention, despite their theoretical advantages over autoregressive models.

DetailsMotivation: To address the limitations of current mask diffusion language models, particularly those based on absorbing diffusion, which struggle to achieve their theoretical advantages of parallel generation and bidirectional attention.

Method: Analysis of inherent difficulties in mask diffusion models and proposal of more effective training and inference strategies for mask diffusion.

Result: Demonstrated why mask diffusion faces inherent challenges in achieving parallel generation and bidirectional attention capabilities.

Conclusion: Proposed improved training and inference approaches to overcome the identified limitations of mask diffusion language models.

Abstract: The main advantages of diffusion language models over autoregressive (AR) models lie in their ability to support parallel generation and bidirectional attention, enabling a more controllable generation process. In recent years, open-source mask diffusion language models have emerged, most of which are based on a variant known as absorbing diffusion. However, this paper demonstrates why mask diffusion faces inherent difficulties in achieving parallel generation and bidirectional attention. We also propose the most effective training and inference strategies for mask diffusion.

[707] Single-Core Superscalar Optimization of Clifford Neural Layers

X. Angelo Huang, Ruben Ciranni, Giovanni Spadaccini, Carla J. López Zurita

Main category: cs.LG

TL;DR: This paper analyzes Clifford neural layers and proposes optimizations to speed up inference while maintaining equivariance properties, achieving a 21.35x average speedup.

DetailsMotivation: There is growing interest in developing networks with equivariance properties in physical sciences, and Clifford neural layers provide E(n) and O(n) equivariances but need performance improvements.

Method: The authors analyze the inner structure of Clifford convolutional layers, eliminate redundant matrix allocations and computations using theoretical foundations of Clifford algebras, and apply established optimization techniques.

Result: The optimized implementation achieves an average speedup of 21.35x over baseline for eleven functions, with runtimes comparable to or faster than original PyTorch implementation in six cases, and same order of magnitude performance in remaining cases.

Conclusion: The proposed optimizations successfully enhance the performance of Clifford neural layers while maintaining their equivariance properties, making them more practical for applications in physical sciences.

Abstract: Within the growing interest in the physical sciences in developing networks with equivariance properties, Clifford neural layers shine as one approach that delivers $E(n)$ and $O(n)$ equivariances given specific group actions. In this paper, we analyze the inner structure of the computation within Clifford convolutional layers and propose and implement several optimizations to speed up the inference process while maintaining correctness. In particular, we begin by analyzing the theoretical foundations of Clifford algebras to eliminate redundant matrix allocations and computations, then systematically apply established optimization techniques to enhance performance further. We report a final average speedup of 21.35x over the baseline implementation of eleven functions and runtimes comparable to and faster than the original PyTorch implementation in six cases. In the remaining cases, we achieve performance in the same order of magnitude as the original library.

[708] UniPruning: Unifying Local Metric and Global Feedback for Scalable Sparse LLMs

Yizhuo Ding, Wanying Qu, Jiawei Geng, Wenqi Shao, Yanwei Fu

Main category: cs.LG

TL;DR: UniPruning is a unified post-training pruning framework that combines local saliency metrics with global coordination using mirror descent optimization, enabling efficient sparsification of LLMs without weight updates.

DetailsMotivation: LLMs face prohibitive computational and memory costs, and existing pruning methods struggle to balance efficiency and robustness - local methods collapse under high sparsity while global methods are expensive or restrictive.

Method: Combines fast layer-wise scoring with lightweight global controller using mirror descent optimization, supporting both unstructured and semi-structured N:M pruning without updating model weights.

Result: Consistently delivers competitive or superior perplexity and zero-shot accuracy across multiple pretrained LLM families and benchmarks, with ablation studies confirming importance of mirror descent and local saliency anchoring.

Conclusion: UniPruning provides an efficient, principled, and scalable solution for sparsifying large-scale LLMs, offering one-shot pruning mask generation for arbitrary sparsity levels with hardware-aware adaptation.

Abstract: Large Language Models (LLMs) achieve strong performance across diverse tasks but face prohibitive computational and memory costs. Pruning offers a promising path by inducing sparsity while preserving architectural flexibility. However, existing methods struggle to balance efficiency and robustness: local metric approaches prune layer by layer but often collapse under high sparsity, whereas global feedback methods enforce consistency at the cost of expensive weight updates or restrictive semi-structured formats. We present UniPruning, a unified post-training pruning framework that combines the speed of local saliency metrics with the stability of global coordination, enabled by a mirror descent based optimization, all without updating model weights. UniPruning leverages fast layer-wise scoring and a lightweight global controller to allocate a single sparsity budget, supporting both unstructured and semi-structured N :M pruning within one framework. After a brief calibration, it can generate pruning masks for arbitrary sparsity levels in one shot, and adapts seamlessly to hardware-aware constraints. Extensive experiments on multiple pretrained LLM families and standard benchmarks show that UniPruning consistently delivers competitive or superior perplexity and zero-shot accuracy. Ablation studies further highlight the importance of mirror descent and local saliency anchoring. Overall, UniPruning provides an efficient, principled, and scalable solution for sparsifying large-scale LLMs. Our code is available at: https://github.com/RainbowQTT/UniPruning.

[709] From Score Distributions to Balance: Plug-and-Play Mixture-of-Experts Routing

Rana Shahout, Colin Cai, Yilun Du, Minlan Yu, Michael Mitzenmacher

Main category: cs.LG

TL;DR: LASER is a plug-and-play inference-time routing algorithm for Mixture-of-Experts models that improves load balancing without retraining, reducing latency and increasing throughput while maintaining accuracy.

DetailsMotivation: MoE models reduce training costs through conditional routing but create inference memory burdens and load imbalance issues where some experts become overloaded while others are underutilized, degrading system performance in latency, throughput, and cost.

Method: LASER adapts to the gate’s score distribution - when scores show clear preference, it routes to strongest experts; when scores are uniform, it broadens viable expert set and routes to least-loaded ones. It uses only gate scores from trained models without retraining or finetuning.

Result: LASER improves load balancing, translating into lower latency and higher throughput on Mixtral-8x7B and DeepSeek-MoE-16b-chat across four datasets (ARC-Easy, ARC-Challenge, MMLU, GSM8K), while keeping accuracy changes negligible.

Conclusion: LASER provides an effective plug-and-play solution for MoE inference load balancing that integrates directly into existing pipelines without model modifications, achieving performance improvements while preserving accuracy.

Abstract: Mixture-of-Experts (MoE) models can scale parameter capacity by routing each token to a subset of experts through a learned gate function. While conditional routing reduces training costs, it shifts the burden on inference memory: expert parameters and activations consume memory, limiting the number of experts per device. As tokens are routed, some experts become overloaded while others are underutilized. Because experts are mapped to GPUs, this imbalance translates directly into degraded system performance in terms of latency, throughput, and cost. We present LASER, a plug-and-play, inference-time routing algorithm that balances load while preserving accuracy. LASER adapts to the shape of the gate’s score distribution. When scores provide a clear preference, it routes to the strongest experts; when scores are more uniform, it broadens the set of viable experts and routes to the least-loaded among them. Because LASER relies only on gate scores from a trained model, it integrates directly into existing MoE inference pipelines without retraining or finetuning. We evaluate LASER on Mixtral-8x7B and DeepSeek-MoE-16b-chat across four datasets (ARC-Easy, ARC-Challenge, MMLU, and GSM8K). LASER improves load balancing, translating into lower latency and higher throughput, while keeping the accuracy changes negligible.

[710] CAFL-L: Constraint-Aware Federated Learning with Lagrangian Dual Optimization for On-Device Language Models

Dongqi Zheng, Wenjin Fu

Main category: cs.LG

TL;DR: CAFL-L extends FedAvg with Lagrangian dual optimization to handle device resource constraints (energy, communication, memory, thermal) by dynamically adapting training hyperparameters while maintaining training stability.

DetailsMotivation: To enable federated learning on resource-constrained edge devices by explicitly incorporating device-level resource constraints that standard FedAvg doesn't address.

Method: Uses Lagrangian dual optimization to dynamically adapt training hyperparameters (freezing depth, local steps, batch size, communication compression) with token-budget preservation via gradient accumulation.

Result: Achieves superior constraint satisfaction compared to standard FedAvg - reduces memory usage by 20% and communication by 95% while maintaining competitive validation performance.

Conclusion: CAFL-L makes federated learning practical for deployment on resource-constrained edge devices by effectively managing resource constraints without sacrificing model performance.

Abstract: We introduce Constraint-Aware Federated Learning with Lagrangian Dual Optimization (CAFL-L), a principled extension of FedAvg that explicitly incorporates device-level resource constraints including energy, communication, memory, and thermal budgets. CAFL-L employs Lagrangian dual optimization to dynamically adapt training hyperparameters – freezing depth, local steps, batch size, and communication compression – while preserving training stability through token-budget preservation via gradient accumulation. Experiments on a character-level language model demonstrate that CAFL-L achieves superior constraint satisfaction compared to standard FedAvg (reducing memory usage by 20% and communication by 95%) while maintaining competitive validation performance, making it practical for deployment on resource-constrained edge devices.

[711] Dynamic Meta-Learning for Adaptive XGBoost-Neural Ensembles

Arthur Sedek

Main category: cs.LG

TL;DR: Novel adaptive ensemble framework combining XGBoost and neural networks using meta-learning, uncertainty quantification, and feature importance for dynamic model selection.

DetailsMotivation: To develop more intelligent and flexible machine learning systems that can adaptively combine different model types for superior performance and interpretability.

Method: Synergistic combination of XGBoost and neural networks through meta-learning, incorporating advanced uncertainty quantification techniques and feature importance integration to dynamically orchestrate model selection and combination.

Result: Superior predictive performance and enhanced interpretability across diverse datasets.

Conclusion: The proposed framework contributes to the development of more intelligent and flexible machine learning systems through adaptive ensemble methods.

Abstract: This paper introduces a novel adaptive ensemble framework that synergistically combines XGBoost and neural networks through sophisticated meta-learning. The proposed method leverages advanced uncertainty quantification techniques and feature importance integration to dynamically orchestrate model selection and combination. Experimental results demonstrate superior predictive performance and enhanced interpretability across diverse datasets, contributing to the development of more intelligent and flexible machine learning systems.

[712] Revoking Amnesia: RL-based Trajectory Optimization to Resurrect Erased Concepts in Diffusion Models

Daiheng Gao, Nanxiang Jiang, Andi Zhang, Shilin Lu, Yufei Tang, Wenbo Zhou, Weiming Zhang, Zhaoxin Fan

Main category: cs.LG

TL;DR: Concept erasure in diffusion models creates only an illusion of forgetting by biasing sampling trajectories rather than genuine concept removal, making the erasure reversible. RevAm framework resurrects erased concepts through RL-based trajectory optimization without modifying model weights.

DetailsMotivation: Established concept erasure methods show degraded effectiveness in next-generation models like Flux, and the true mechanism of erasure is revealed to be trajectory manipulation rather than genuine forgetting, creating reversible safety vulnerabilities.

Method: Proposed RevAm framework uses RL-based trajectory optimization with Group Relative Policy Optimization (GRPO) adapted to diffusion models, dynamically steering the denoising process to resurrect erased concepts through trajectory-level rewards without weight modification.

Result: RevAm achieves superior concept resurrection fidelity while reducing computational time by 10x, exposing critical vulnerabilities in current safety mechanisms and demonstrating the reversibility of concept erasure.

Conclusion: Current concept erasure techniques provide only superficial safety through trajectory manipulation, highlighting the need for more robust erasure methods that achieve genuine concept removal rather than reversible trajectory biasing.

Abstract: Concept erasure techniques have been widely deployed in T2I diffusion models to prevent inappropriate content generation for safety and copyright considerations. However, as models evolve to next-generation architectures like Flux, established erasure methods (\textit{e.g.}, ESD, UCE, AC) exhibit degraded effectiveness, raising questions about their true mechanisms. Through systematic analysis, we reveal that concept erasure creates only an illusion of ``amnesia": rather than genuine forgetting, these methods bias sampling trajectories away from target concepts, making the erasure fundamentally reversible. This insight motivates the need to distinguish superficial safety from genuine concept removal. In this work, we propose \textbf{RevAm} (\underline{Rev}oking \underline{Am}nesia), an RL-based trajectory optimization framework that resurrects erased concepts by dynamically steering the denoising process without modifying model weights. By adapting Group Relative Policy Optimization (GRPO) to diffusion models, RevAm explores diverse recovery trajectories through trajectory-level rewards, overcoming local optima that limit existing methods. Extensive experiments demonstrate that RevAm achieves superior concept resurrection fidelity while reducing computational time by 10$\times$, exposing critical vulnerabilities in current safety mechanisms and underscoring the need for more robust erasure techniques beyond trajectory manipulation.

[713] Machine Learning Workflows in Climate Modeling: Design Patterns and Insights from Case Studies

Tian Zheng, Subashree Venkatasubramanian, Shuolin Li, Amy Braverman, Xinyi Ke, Zhewen Hou, Peter Jin, Samarth Sanjay Agrawal

Main category: cs.LG

TL;DR: This paper analyzes workflow design patterns in machine learning applications for climate modeling, focusing on surrogate modeling, ML parameterization, probabilistic programming, simulation-based inference, and physics-informed transfer learning.

DetailsMotivation: To address challenges in climate modeling such as physical consistency, multi-scale coupling, data sparsity, robust generalization, and integration with scientific workflows through machine learning applications.

Method: Analysis of case studies from applied machine learning research in climate modeling, with focus on synthesizing workflow design patterns rather than technical details.

Result: The paper provides a framework for ensuring rigor in scientific machine learning through transparent model development, critical evaluation, informed adaptation, and reproducibility.

Conclusion: The research aims to lower barriers for interdisciplinary collaboration between data science and climate modeling by offering structured workflow patterns and frameworks for scientific machine learning applications.

Abstract: Machine learning has been increasingly applied in climate modeling on system emulation acceleration, data-driven parameter inference, forecasting, and knowledge discovery, addressing challenges such as physical consistency, multi-scale coupling, data sparsity, robust generalization, and integration with scientific workflows. This paper analyzes a series of case studies from applied machine learning research in climate modeling, with a focus on design choices and workflow structure. Rather than reviewing technical details, we aim to synthesize workflow design patterns across diverse projects in ML-enabled climate modeling: from surrogate modeling, ML parameterization, probabilistic programming, to simulation-based inference, and physics-informed transfer learning. We unpack how these workflows are grounded in physical knowledge, informed by simulation data, and designed to integrate observations. We aim to offer a framework for ensuring rigor in scientific machine learning through more transparent model development, critical evaluation, informed adaptation, and reproducibility, and to contribute to lowering the barrier for interdisciplinary collaboration at the interface of data science and climate modeling.

[714] Thin Bridges for Drug Text Alignment: Lightweight Contrastive Learning for Target Specific Drug Retrieval

Mallikarjuna Tupakula

Main category: cs.LG

TL;DR: Thin contrastive bridges using lightweight projection heads over frozen unimodal encoders can align chemical and textual representations without full multimodal model training, achieving scaffold-aware drug-text alignment and target-specific retrieval.

DetailsMotivation: Multimodal foundation models for drug discovery typically require heavy pretraining or large multimodal corpora, which is computationally expensive. The paper investigates whether lightweight approaches can achieve similar alignment.

Method: Align ECFP4 molecular fingerprints with biomedical sentence embeddings using dual linear projections trained with contrastive objective. Incorporate hard negative weighting and margin loss to handle drugs sharing same therapeutic target. Evaluate under scaffold-based splits.

Result: Achieves non-trivial cross-modal alignment and substantially improves within-target discrimination compared to frozen baselines. Demonstrates effective scaffold-aware drug-text alignment.

Conclusion: Thin bridges offer a compute-efficient alternative to large-scale multimodal pretraining, enabling scaffold-aware drug-text alignment and target-specific retrieval in precision medicine applications.

Abstract: Multimodal foundation models hold promise for drug discovery and biomedical applications, but most existing approaches rely on heavy pretraining or large scale multimodal corpora. We investigate whether thin contrastive bridges, lightweight projection heads over frozen unimodal encoders can align chemical and textual representations without training a full multimodal model. Using paired mechanisms from ChEMBL, we align ECFP4 molecular fingerprints with biomedical sentence embeddings through dual linear projections trained with a contrastive objective. To better handle drugs sharing the same therapeutic target, we incorporate hard negative weighting and a margin loss. Evaluation under scaffold based splits, which require generalization across disjoint chemical cores, demonstrates that our approach achieves non-trivial cross modal alignment and substantially improves within target discrimination compared to frozen baselines. These results suggest that thin bridges offer a compute efficient alternative to large scale multimodal pretraining, enabling scaffold aware drug text alignment and target specific retrieval in precision medicine.

[715] Predicting Effects, Missing Distributions: Evaluating LLMs as Human Behavior Simulators in Operations Management

Runze Zhang, Xiaowei Zhang, Mingyang Zhao

Main category: cs.LG

TL;DR: LLMs can replicate human behavior in operations management experiments, reproducing most hypothesis-level effects but showing distributional differences from human data. Lightweight interventions like chain-of-thought prompting and hyperparameter tuning can reduce misalignment.

DetailsMotivation: To evaluate how well LLMs can simulate human behavior in operations management as a lower-cost alternative to traditional experimental methods like lab experiments, field studies, and surveys.

Method: Used nine published experiments in behavioral operations to assess LLMs on two criteria: replication of hypothesis-test outcomes and distributional alignment via Wasserstein distance. Tested two interventions: chain-of-thought prompting and hyperparameter tuning.

Result: LLMs reproduced most hypothesis-level effects and captured key decision biases, but their response distributions diverged from human data, even for strong commercial models. The interventions reduced misalignment and sometimes enabled smaller or open-source models to match or surpass larger systems.

Conclusion: LLMs show promise for simulating human behavior in operations management but require careful calibration to align with human response distributions, with lightweight interventions offering potential improvements.

Abstract: LLMs are emerging tools for simulating human behavior in business, economics, and social science, offering a lower-cost complement to laboratory experiments, field studies, and surveys. This paper evaluates how well LLMs replicate human behavior in operations management. Using nine published experiments in behavioral operations, we assess two criteria: replication of hypothesis-test outcomes and distributional alignment via Wasserstein distance. LLMs reproduce most hypothesis-level effects, capturing key decision biases, but their response distributions diverge from human data, including for strong commercial models. We also test two lightweight interventions – chain-of-thought prompting and hyperparameter tuning – which reduce misalignment and can sometimes let smaller or open-source models match or surpass larger systems.

[716] Scaling Laws Revisited: Modeling the Role of Data Quality in Language Model Pretraining

Anirudh Subramanyam, Yuxin Chen, Robert L. Grossman

Main category: cs.LG

TL;DR: The paper introduces a data quality parameter Q and extends scaling laws to include data quality alongside model size and data volume, showing that higher-quality data can reduce model size requirements.

DetailsMotivation: Prior scaling laws focus on model size and data volume but lack formal treatment of data quality. This work aims to establish principled scaling laws that incorporate data quality as a key dimension.

Method: Proposes a quality-aware scaling law extending Chinchilla framework, with two estimators for Q: corruption rate proxy and deficiency measure. Validates through synthetic experiments with controlled noise injection and coverage variation in neural machine translation and autoregressive modeling.

Result: Loss scales predictably with data quality; higher-quality data substantially reduces model size and compute requirements. Shows sublinear decay of effective data with quality and robustness to moderate data corruption.

Conclusion: Establishes an explicit, generalizable law for data quality that provides concrete guidance for balancing data curation effort and model scale in large-scale pretraining.

Abstract: Scaling laws for language model training traditionally characterize how performance scales with model size and dataset volume. Prior work has explored architecture variants and data treatments such as dataset filtering and noise injection in language model pretraining; however, these studies have not formalized data quality within a principled scaling law. We introduce a dimensionless data-quality parameter Q, and propose a quality-aware scaling law extending the Chinchilla framework to predict loss as a joint function of model size, data volume, and data quality. The law is motivated by an effective-sample-size and information-theoretic view of noisy or redundant corpora, and it admits two practical estimators for Q: (i) a corruption rate proxy and (ii) a deficiency measure. Through synthetic experiments in neural machine translation and autoregressive modeling – where we systematically control data quality via multiple levels of noise injection and coverage variation – we show that loss scales predictably with data quality and that higher-quality data can substantially reduce model size and hence compute requirements. Our results demonstrate a sublinear decay of effective data with quality and robustness to moderate data corruption; out-of-sample evaluations further validate the predictive form of the law. Unlike prior empirical analyses, our work establishes an explicit, generalizable law for data quality, offering concrete guidance for balancing data curation effort and model scale in large-scale pretraining.

[717] Fast frequency reconstruction using Deep Learning for event recognition in ring laser data

Giuseppe Di Somma, Giorgio Carelli, Angela D. V. Di Virgilio, Francesco Fuso, Enrico Maccioni, Paolo Marsili

Main category: cs.LG

TL;DR: Neural network approach for fast frequency reconstruction from sinusoidal signals, achieving 10ms response time and 2x better precision than Fourier methods, with automated classification of physical disturbances achieving 99-100% accuracy.

DetailsMotivation: Need for minimal delay frequency reconstruction in applications like Ring Laser Gyroscopes, where conventional methods require several seconds of data but real-time analysis is needed.

Method: Neural network approach for frequency estimation and automated classification framework to identify physical disturbances like laser instabilities and seismic events.

Result: Frequency reconstruction within ~10ms (vs seconds for conventional methods), 2x improvement in frequency estimation precision, and 99-100% accuracy for seismic event classification on test datasets.

Conclusion: This represents significant progress in integrating AI into signal analysis for geophysical applications, enabling rapid trigger generation and improved disturbance identification.

Abstract: The reconstruction of a frequency with minimal delay from a sinusoidal signal is a common task in several fields; for example Ring Laser Gyroscopes, since their output signal is a beat frequency. While conventional methods require several seconds of data, we present a neural network approach capable of reconstructing frequencies of several hundred Hertz within approximately 10 milliseconds. This enables rapid trigger generation. The method outperforms standard Fourier-based techniques, improving frequency estimation precision by a factor of 2 in the operational range of GINGERINO, our Ring Laser Gyroscope.\ In addition to fast frequency estimation, we introduce an automated classification framework to identify physical disturbances in the signal, such as laser instabilities and seismic events, achieving accuracy rates between 99% and 100% on independent test datasets for the seismic class. These results mark a step forward in integrating artificial intelligence into signal analysis for geophysical applications.

[718] Constant in an Ever-Changing World

Andy Wu, Chun-Cheng Lin, Yuehua Huang, Rung-Tzuo Liaw

Main category: cs.LG

TL;DR: CIC framework enhances RL stability by maintaining representative and current policies, selectively updating the representative policy only when current policy is superior, and using adaptive adjustment for joint critic training.

DetailsMotivation: Reinforcement learning training often suffers from severe oscillations that cause instability and degraded performance.

Method: Maintains representative and current policies, selectively updates representative policy only when current policy demonstrates superiority, and uses adaptive adjustment mechanism for joint critic training.

Result: Evaluation on five MuJoCo environments shows CIC improves performance of conventional algorithms without additional computational cost.

Conclusion: CIC framework effectively enhances algorithmic stability and improves RL performance without extra computational overhead.

Abstract: The training process of reinforcement learning often suffers from severe oscillations, leading to instability and degraded performance. In this paper, we propose a Constant in an Ever-Changing World (CIC) framework that enhances algorithmic stability to improve performance. CIC maintains both a representative policy and a current policy. Instead of updating the representative policy blindly, CIC selectively updates it only when the current policy demonstrates superiority. Furthermore, CIC employs an adaptive adjustment mechanism, enabling the representative and current policies to jointly facilitate critic training. We evaluate CIC on five MuJoCo environments, and the results show that CIC improves the performance of conventional algorithms without incurring additional computational cost.

[719] Semantic-Aware Scheduling for GPU Clusters with Large Language Models

Zerui Wang, Qinghao Hu, Ana Klimovic, Tianwei Zhang, Yonggang Wen, Peng Sun, Dahua Lin

Main category: cs.LG

TL;DR: SchedMate is a framework that enhances DL schedulers by using LLMs to extract semantic insights from source code, runtime logs, and historical job data, reducing job completion times by up to 1.91x.

DetailsMotivation: Current DL schedulers lack semantic context about jobs, relying only on limited metadata which leads to profiling overhead, unreliable duration estimation, poor failure handling, and inadequate observability.

Method: SchedMate uses three LLM-based components to systematically extract insights from unstructured data sources (source code, runtime logs, historical jobs) and integrates non-intrusively with existing DL schedulers.

Result: Evaluations on a 128-GPU physical cluster and production trace simulations show SchedMate reduces average job completion times by up to 1.91x and substantially improves scheduling performance.

Conclusion: Semantic-awareness plays a critical role in modern DL scheduling, and SchedMate effectively bridges the semantic gap in existing schedulers through LLM-based analysis of unstructured data sources.

Abstract: Deep learning (DL) schedulers are pivotal in optimizing resource allocation in GPU clusters, but operate with a critical limitation: they are largely blind to the semantic context of the jobs they manage. This forces them to rely on limited metadata, leading to high profiling overhead, unreliable duration estimation, inadequate failure handling, and poor observability. To this end, we propose SchedMate, a framework that bridges this semantic gap by systematically extracting deep insights from overlooked, unstructured data sources: source code, runtime logs, and historical jobs. SchedMate enhances existing schedulers non-intrusively through three LLM-based components. Our implementation integrates seamlessly with existing deep learning schedulers. Evaluations on a 128-GPU physical cluster and extensive simulations on production traces show SchedMate reduces average job completion times by up to 1.91x, substantially enhancing the scheduling performance, demonstrating the critical role of semantic-awareness in modern DL scheduling.

[720] Pool Me Wisely: On the Effect of Pooling in Transformer-Based Models

Sofiane Ennadir, Levente Zólyomi, Oleg Smirnov, Tianze Wang, John Pertoft, Filip Cornell, Lele Cao

Main category: cs.LG

TL;DR: This paper provides a theoretical framework analyzing the expressivity of Transformer models with different pooling methods, showing pooling’s critical impact on model behavior across various tasks and modalities.

DetailsMotivation: While attention mechanisms in Transformers have been extensively studied, pooling operations that aggregate token representations into fixed-size vectors remain underexplored despite their significant impact on model performance.

Method: Developed a theoretical framework to characterize Transformer expressivity with different pooling methods, derived closed-form bounds on representational capacity, and empirically evaluated pooling strategies across computer vision, NLP, and time-series tasks.

Result: The analysis revealed consistent trends in how pooling choices affect accuracy, sensitivity, and optimization behavior across different tasks and modalities, with theoretical bounds holding across various attention formulations.

Conclusion: Pooling should be considered a key architectural component in Transformer models, and the work provides practical guidance for selecting pooling mechanisms based on task requirements, establishing foundations for more principled model design beyond attention alone.

Abstract: Transformer models have become the dominant backbone for sequence modeling, leveraging self-attention to produce contextualized token representations. These are typically aggregated into fixed-size vectors via pooling operations for downstream tasks. While much of the literature has focused on attention mechanisms, the role of pooling remains underexplored despite its critical impact on model behavior. In this paper, we introduce a theoretical framework that rigorously characterizes the expressivity of Transformer-based models equipped with widely used pooling methods by deriving closed-form bounds on their representational capacity and the ability to distinguish similar inputs. Our analysis extends to different variations of attention formulations, demonstrating that these bounds hold across diverse architectural variants. We empirically evaluate pooling strategies across tasks requiring both global and local contextual understanding, spanning three major modalities: computer vision, natural language processing, and time-series analysis. Results reveal consistent trends in how pooling choices affect accuracy, sensitivity, and optimization behavior. Our findings unify theoretical and empirical perspectives, providing practical guidance for selecting or designing pooling mechanisms suited to specific tasks. This work positions pooling as a key architectural component in Transformer models and lays the foundation for more principled model design beyond attention alone.

[721] Learning Pareto-Optimal Pandemic Intervention Policies with MORL

Marian Chen, Miri Zilka

Main category: cs.LG

TL;DR: A MORL framework with SDE pandemic simulator for balancing disease containment and socioeconomic stability, validated on COVID-19 data and extended to other pathogens.

DetailsMotivation: The COVID-19 pandemic highlighted the need for intervention strategies that balance disease control with socioeconomic impacts, requiring multi-objective optimization approaches.

Method: Multi-objective reinforcement learning combined with a stochastic differential equation pandemic simulator calibrated against global COVID-19 data, using Pareto-Conditioned Network agent.

Result: Achieved higher fidelity pandemic dynamics modeling than other RL approaches, demonstrated policy trade-offs between epidemiological control and economic stability, and showed framework’s adaptability to different pathogens with distinct intervention policies.

Conclusion: Provides a robust and adaptable framework for evidence-based policymaking in public health crises, capable of quantifying intervention trade-offs across different disease scenarios.

Abstract: The COVID-19 pandemic underscored a critical need for intervention strategies that balance disease containment with socioeconomic stability. We approach this challenge by designing a framework for modeling and evaluating disease-spread prevention strategies. Our framework leverages multi-objective reinforcement learning (MORL) - a formulation necessitated by competing objectives - combined with a new stochastic differential equation (SDE) pandemic simulator, calibrated and validated against global COVID-19 data. Our simulator reproduces national-scale pandemic dynamics with orders of magnitude higher fidelity than other models commonly used in reinforcement learning (RL) approaches to pandemic intervention. Training a Pareto-Conditioned Network (PCN) agent on this simulator, we illustrate the direct policy trade-offs between epidemiological control and economic stability for COVID-19. Furthermore, we demonstrate the framework’s generality by extending it to pathogens with different epidemiological profiles, such as polio and influenza, and show how these profiles lead the agent to discover fundamentally different intervention policies. To ground our work in contemporary policymaking challenges, we apply the model to measles outbreaks, quantifying how a modest 5% drop in vaccination coverage necessitates significantly more stringent and costly interventions to curb disease spread. This work provides a robust and adaptable framework to support transparent, evidence-based policymaking for mitigating public health crises.

[722] Pilot selection in the era of Virtual reality: algorithms for accurate and interpretable machine learning models

Luoma Ke, Guangpeng Zhang, Jibo He, Yajing Li, Yan Li, Xufeng Liu, Peng Fang

Main category: cs.LG

TL;DR: A machine learning approach using SVM with MIC feature selection achieves high accuracy (0.93) in distinguishing pilots from novices using VR simulation data.

DetailsMotivation: The aviation industry needs cost-efficient pilot selection methods due to rapid growth and high demand for flight crew.

Method: Used machine learning with SVM classifier and MIC feature selection on eye tracking and flight dynamics data from 23 pilots and 23 novices in VR simulations.

Result: SVM with MIC achieved highest performance: Accuracy 0.93, AUC 0.96, F1 0.93, outperforming other classifiers and feature selection methods.

Conclusion: The SVM + MIC algorithm provides superior pilot selection capability using VR simulation data and can be applied for pilot selection and training.

Abstract: With the rapid growth of the aviation industry, there is a need for a large number of flight crew. How to select the right pilots in a cost-efficient manner has become an important research question. In the current study, twenty-three pilots were recruited from China Eastern Airlines, and 23 novices were from the community of Tsinghua University. A novel approach incorporating machine learning and virtual reality technology was applied to distinguish features between these participants with different flight skills. Results indicate that SVM with the MIC feature selection method consistently achieved the highest prediction performance on all metrics with an Accuracy of 0.93, an AUC of 0.96, and an F1 of 0.93, which outperforms four other classifier algorithms and two other feature selection methods. From the perspective of feature selection methods, the MIC method can select features with a nonlinear relationship to sampling labels, instead of a simple filter-out. Our new implementation of the SVM + MIC algorithm outperforms all existing pilot selection algorithms and perhaps provides the first implementation based on eye tracking and flight dynamics data. This study’s VR simulation platforms and algorithms can be used for pilot selection and training.

[723] AgentCaster: Reasoning-Guided Tornado Forecasting

Michael Chen

Main category: cs.LG

TL;DR: AgentCaster is a contamination-free framework using multimodal LLMs for tornado forecasting, tested over 40 days with 500+ tornado reports. Models query from thousands of forecast maps and soundings but significantly underperform human experts, showing issues with hallucination, risk overprediction, and poor spatiotemporal reasoning.

DetailsMotivation: There's a need to evaluate LLMs on complex, high-impact real-world tasks to assess their readiness as reasoning agents, particularly in critical domains like weather forecasting.

Method: AgentCaster uses multimodal LLMs end-to-end for tornado forecasting, interpreting heterogeneous spatiotemporal data from high-resolution forecast archives. Models interactively query from 3,625 forecast maps and 40,125 forecast soundings over 12-36 hour horizons, with probabilistic tornado-risk polygon predictions verified against ground truths.

Result: Human experts significantly outperform state-of-the-art models. LLMs demonstrate strong tendency to hallucinate and overpredict risk intensity, struggle with precise geographic placement, and exhibit poor spatiotemporal reasoning in complex, dynamically evolving systems.

Conclusion: AgentCaster aims to advance research on improving LLM agents for challenging reasoning tasks in critical domains, highlighting current limitations in LLM performance for complex real-world forecasting tasks.

Abstract: There is a growing need to evaluate Large Language Models (LLMs) on complex, high-impact, real-world tasks to assess their true readiness as reasoning agents. To address this gap, we introduce AgentCaster, a contamination-free framework employing multimodal LLMs end-to-end for the challenging, long-horizon task of tornado forecasting. Within AgentCaster, models interpret heterogeneous spatiotemporal data from a high-resolution convection-allowing forecast archive. We assess model performance over a 40-day period featuring diverse historical data, spanning several major tornado outbreaks and including over 500 tornado reports. Each day, models query interactively from a pool of 3,625 forecast maps and 40,125 forecast soundings for a forecast horizon of 12-36 hours. Probabilistic tornado-risk polygon predictions are verified against ground truths derived from geometric comparisons across disjoint risk bands in projected coordinate space. To quantify accuracy, we propose domain-specific TornadoBench and TornadoHallucination metrics, with TornadoBench highly challenging for both LLMs and domain expert human forecasters. Notably, human experts significantly outperform state-of-the-art models, which demonstrate a strong tendency to hallucinate and overpredict risk intensity, struggle with precise geographic placement, and exhibit poor spatiotemporal reasoning in complex, dynamically evolving systems. AgentCaster aims to advance research on improving LLM agents for challenging reasoning tasks in critical domains.

[724] High Cycle S-N curve prediction for Al 7075-T6 alloy using Recurrent Neural Networks (RNNs)

Aryan Patel

Main category: cs.LG

TL;DR: A transfer learning framework using LSTM networks was developed to predict high cycle torsional S-N curves for Aluminum 7075-T6 alloy, reducing the cost and time of fatigue testing.

DetailsMotivation: Characterizing fatigue performance for materials like aluminum is extremely time-consuming and expensive, especially for high cycle data, creating a need for more efficient prediction methods.

Method: Used transfer learning with LSTM networks - trained a source model on pure axial fatigue data for Aluminum 7075-T6, then transferred it to predict high cycle torsional S-N curves.

Result: The framework accurately predicted aluminum torsional S-N curves for much higher cycle ranges than traditional testing methods.

Conclusion: This transfer learning approach can drastically reduce the cost of gathering fatigue characteristics for different materials and help prioritize tests with better cost and time constraints.

Abstract: Aluminum is a widely used alloy, which is susceptible to fatigue failure. Characterizing fatigue performance for materials is extremely time and cost demanding, especially for high cycle data. To help mitigate this, a transfer learning based framework has been developed using Long short-term memory networks (LSTMs) in which a source LSTM model is trained based on pure axial fatigue data for Aluminum 7075-T6 alloy which is then transferred to predict high cycle torsional S-N curves. The framework was able to accurately predict Al torsional S-N curves for a much higher cycle range. It is the belief that this framework will help to drastically mitigate the cost of gathering fatigue characteristics for different materials and help prioritize tests with better cost and time constraints.

[725] Understanding Transformers for Time Series: Rank Structure, Flow-of-ranks, and Compressibility

Annan Yu, Danielle C. Maddix, Boran Han, Xiyuan Zhang, Abdul Fatir Ansari, Oleksandr Shchur, Christos Faloutsos, Andrew Gordon Wilson, Michael W. Mahoney, Yuyang Wang

Main category: cs.LG

TL;DR: Transformers in time series exhibit low-rank structure due to decaying singular value spectra, enabling effective compression of attention layers without accuracy loss.

DetailsMotivation: Principles from text models don't transfer well to time series due to different structural properties, particularly sharp spectral decay in time-series embeddings.

Method: Analyze rank structure of time-series embeddings, prove low-rank approximations for Q/K/V projections, introduce flow-of-ranks concept, and apply compression to Chronos model.

Result: Successfully compressed Chronos time series foundation model with 65% reduction in inference time and 81% reduction in memory, maintaining accuracy.

Conclusion: Time series Transformers have inherent compressibility due to low-rank structure, providing principled guidance for model architecture design and compression.

Abstract: Transformers are widely used across data modalities, and yet the principles distilled from text models often transfer imperfectly to models trained to other modalities. In this paper, we analyze Transformers through the lens of rank structure. Our focus is on the time series setting, where the structural properties of the data differ remarkably from those of text or vision. We show that time-series embeddings, unlike text or vision, exhibit sharply decaying singular value spectra: small patch sizes and smooth continuous mappings concentrate the data into low-rank subspaces. From this, we prove that the associated $Q/K/V$ projections admit accurate low-rank approximations, and that attention layers become compressible in proportion to the decay of the embedding spectrum. We introduce the concept of flow-of-ranks, a phenomenon by which nonlinear mixing across depth inflates the rank, explaining why early layers are most amenable to compression and why ranks grow with depth. Guided by these theoretical and empirical results, we use these insights to compress Chronos, a large time series foundation model, achieving a reduction of $65%$ in inference time and $81%$ in memory, without loss of accuracy. Our findings provide principled guidance for allocating width, depth, and heads in time series foundation models, and for exploiting their inherent compressibility.

[726] Physics-informed Neural-operator Predictive Control for Drag Reduction in Turbulent Flows

Zelin Zhao, Zongyi Li, Kimia Hassibi, Kamyar Azizzadenesheli, Junchi Yan, H. Jane Bae, Di Zhou, Anima Anandkumar

Main category: cs.LG

TL;DR: Proposes PINO-PC, a model-based deep reinforcement learning framework using Physics Informed Neural Operators for efficient turbulence control and drag reduction in high Reynolds number flows.

DetailsMotivation: Numerical assessment of turbulence control effects for wall friction is computationally expensive, requiring costly simulations of turbulent fluid dynamics.

Method: Model-based reinforcement learning with predictive control, where both policy and observer models are learned jointly using Physics Informed Neural Operators (PINO) that are discretization invariant and capture fine turbulent scales.

Result: PINO-PC achieves 39.0% drag reduction at Reynolds number 15,000, outperforming previous fluid control methods by more than 32% and showing better performance than model-free RL methods in high Reynolds number and unseen flow scenarios.

Conclusion: The proposed PINO-PC framework provides an efficient and effective approach for turbulence control, demonstrating superior performance in challenging high Reynolds number flow conditions compared to existing methods.

Abstract: Assessing turbulence control effects for wall friction numerically is a significant challenge since it requires expensive simulations of turbulent fluid dynamics. We instead propose an efficient deep reinforcement learning (RL) framework for modeling and control of turbulent flows. It is model-based RL for predictive control (PC), where both the policy and the observer models for turbulence control are learned jointly using Physics Informed Neural Operators (PINO), which are discretization invariant and can capture fine scales in turbulent flows accurately. Our PINO-PC outperforms prior model-free reinforcement learning methods in various challenging scenarios where the flows are of high Reynolds numbers and unseen, i.e., not provided during model training. We find that PINO-PC achieves a drag reduction of 39.0% under a bulk-velocity Reynolds number of 15,000, outperforming previous fluid control methods by more than 32%.

Lijiao Wang, Muhammad Usama, Haris N. Koutsopoulos, Zhengbing He

Main category: cs.LG

TL;DR: A data-driven framework using open-source GPS data, road networks, and satellite imagery to estimate vehicle operating modes and emissions, achieving over 50% RMSE reduction compared to MOVES baseline.

DetailsMotivation: To create a scalable and transparent method for estimating vehicle emissions using readily available open-source data instead of proprietary or expensive data sources.

Method: Integration of MOVES with open-source GPS trajectories, OpenStreetMap road networks, regional traffic data, and satellite imagery features. A neural network model predicts MOVES-defined operating mode distributions using derived features from available data.

Result: Applied to 45 municipalities in Boston Metropolitan area, the model reduced RMSE by over 50% for regional traffic emissions of CO, NOx, CO2, and PM2.5 compared to MOVES baseline.

Conclusion: Demonstrates feasibility of low-cost, replicable, and data-driven emissions estimation using fully open data sources, offering a scalable alternative to traditional methods.

Abstract: Open-source data offers a scalable and transparent foundation for estimating vehicle activity and emissions in urban regions. In this study, we propose a data-driven framework that integrates MOVES and open-source GPS trajectory data, OpenStreetMap (OSM) road networks, regional traffic datasets and satellite imagery-derived feature vectors to estimate the link level operating mode distribution and traffic emissions. A neural network model is trained to predict the distribution of MOVES-defined operating modes using only features derived from readily available data. The proposed methodology was applied using open-source data related to 45 municipalities in the Boston Metropolitan area. The “ground truth” operating mode distribution was established using OSM open-source GPS trajectories. Compared to the MOVES baseline, the proposed model reduces RMSE by over 50% for regional scale traffic emissions of key pollutants including CO, NOx, CO2, and PM2.5. This study demonstrates the feasibility of low-cost, replicable, and data-driven emissions estimation using fully open data sources.

[728] Diffusion-Based, Data-Assimilation-Enabled Super-Resolution of Hub-height Winds

Xiaolong Ma, Xu Dong, Ashley Tarrant, Lei Yang, Rao Kotamarthi, Jiali Wang, Feng Yan, Rajkumar Kettimuthu

Main category: cs.LG

TL;DR: WindSR is a diffusion model with data assimilation for super-resolution downscaling of hub-height winds, integrating sparse observations with simulations to generate high-quality wind speed data at infrastructure scales.

DetailsMotivation: High-quality hub-height wind observations are sparse, while simulations are biased and too coarse for wind-farm siting and extreme weather risk assessment at infrastructure scales.

Method: Uses diffusion models with dynamic-radius blending to merge observations with simulations, incorporating terrain information during training and inference.

Result: Outperforms CNN and GAN baselines in downscaling efficiency and accuracy, with data assimilation reducing model bias by ~20% relative to independent observations.

Conclusion: WindSR successfully integrates observations and simulations to produce high-resolution hub-height wind data, improving accuracy for wind energy applications.

Abstract: High-quality observations of hub-height winds are valuable but sparse in space and time. Simulations are widely available on regular grids but are generally biased and too coarse to inform wind-farm siting or to assess extreme-weather-related risks (e.g., gusts) at infrastructure scales. To fully utilize both data types for generating high-quality, high-resolution hub-height wind speeds (tens to ~100m above ground), this study introduces WindSR, a diffusion model with data assimilation for super-resolution downscaling of hub-height winds. WindSR integrates sparse observational data with simulation fields during downscaling using state-of-the-art diffusion models. A dynamic-radius blending method is introduced to merge observations with simulations, providing conditioning for the diffusion process. Terrain information is incorporated during both training and inference to account for its role as a key driver of winds. Evaluated against convolutional-neural-network and generative-adversarial-network baselines, WindSR outperforms them in both downscaling efficiency and accuracy. Our data assimilation reduces WindSR’s model bias by approximately 20% relative to independent observations.

[729] Disentangling Recall and Reasoning in Transformer Models through Layer-wise Attention and Activation Analysis

Harshwardhan Fartale, Ashish Kattamuri, Rahul Raja, Arpita Vats, Ishita Prasad, Akshata Kishore Moharir

Main category: cs.LG

TL;DR: Transformer models use distinct neural circuits for recall (fact retrieval) and reasoning (multi-step inference), with separable but interacting circuits identified through causal interventions.

DetailsMotivation: To determine if recall and reasoning abilities in transformer models rely on distinct internal mechanisms, which is crucial for predicting generalization, designing evaluations, and building safer interventions.

Method: Used mechanistic interpretability with controlled synthetic linguistic puzzles, combining activation patching and structured ablations to measure component contributions at layer, head, and neuron levels across Qwen and LLaMA model families.

Result: Found that disabling identified “recall circuits” reduced fact-retrieval accuracy by up to 15% while leaving reasoning intact, and disabling “reasoning circuits” reduced multi-step inference by comparable margins. Task-specific neuron firing patterns were observed but less robust.

Conclusion: Provides first causal evidence that recall and reasoning rely on separable but interacting circuits in transformers, advancing mechanistic interpretability and informing safer deployment of large language models.

Abstract: Transformer-based language models excel at both recall (retrieving memorized facts) and reasoning (performing multi-step inference), but whether these abilities rely on distinct internal mechanisms remains unclear. Distinguishing recall from reasoning is crucial for predicting model generalization, designing targeted evaluations, and building safer interventions that affect one ability without disrupting the other.We approach this question through mechanistic interpretability, using controlled datasets of synthetic linguistic puzzles to probe transformer models at the layer, head, and neuron level. Our pipeline combines activation patching and structured ablations to causally measure component contributions to each task type. Across two model families (Qwen and LLaMA), we find that interventions on distinct layers and attention heads lead to selective impairments: disabling identified “recall circuits” reduces fact-retrieval accuracy by up to 15% while leaving reasoning intact, whereas disabling “reasoning circuits” reduces multi-step inference by a comparable margin. At the neuron level, we observe task-specific firing patterns, though these effects are less robust, consistent with neuronal polysemanticity.Our results provide the first causal evidence that recall and reasoning rely on separable but interacting circuits in transformer models. These findings advance mechanistic interpretability by linking circuit-level structure to functional specialization and demonstrate how controlled datasets and causal interventions can yield mechanistic insights into model cognition, informing safer deployment of large language models.

[730] Distributed Low-Communication Training with Decoupled Momentum Optimization

Sasho Nedelkoski, Alexander Acker, Odej Kao, Soeren Becker, Dominik Scheinert

Main category: cs.LG

TL;DR: Proposes a communication-efficient distributed training method that combines infrequent synchronization with gradient momentum compression using DCT to reduce bandwidth requirements.

DetailsMotivation: Reduce reliance on high-bandwidth interconnects for large model training, enabling use of distributed compute resources instead of centralized data centers.

Method: Combines infrequent synchronizations across distributed model replicas with gradient momentum compression using discrete cosine transform (DCT) to decompose Nesterov momentum into high- and low-frequency components, synchronizing only high-frequency components every H steps.

Result: Achieves up to 16× reduction in communication compared to baseline DiLoCo, and generalizes across transformer-based language models and convolutional neural networks.

Conclusion: Advances feasibility of training large models on distributed nodes with low-bandwidth interconnects.

Abstract: The training of large models demands substantial computational resources, typically available only in data centers with high-bandwidth interconnects. However, reducing the reliance on high-bandwidth interconnects between nodes enables the use of distributed compute resources as an alternative to centralized data center training. Building on recent advances in distributed model training, we propose an approach that further reduces communication by combining infrequent synchronizations across distributed model replicas with gradient momentum compression. In particular, we treat the optimizer momentum as a signal and decompose the Nesterov momentum into high- and low-frequency components via the discrete cosine transform (DCT). Only the high-frequency components are synchronized across model replicas every $H$ steps. Empirically, our method achieves up to a $16\times$ reduction in communication compared to the baseline DiLoCo, and it generalizes across architectures, including transformer-based language models and convolutional neural networks for images. Overall, this work advances the feasibility of training large models on distributed nodes with low-bandwidth interconnects.

[731] Conditional Pseudo-Supervised Contrast for Data-Free Knowledge Distillation

Renrong Shao, Wei Zhang, Jun wang

Main category: cs.LG

TL;DR: CPSC-DFKD is a novel data-free knowledge distillation method that uses conditional GANs to generate category-specific diverse images for pseudo-supervised learning, with improved generator modules and contrastive learning to enhance performance.

DetailsMotivation: Current DFKD methods lack pseudo-supervised paradigms, cannot distinguish different category distributions (producing ambiguous samples), and cannot optimize category-wise diversity, which hinders student model learning.

Method: Uses conditional GAN to synthesize category-specific diverse images, improves generator modules to distinguish category distributions, and employs pseudo-supervised contrastive learning based on teacher-student views.

Result: Comprehensive experiments on three datasets validate performance improvements for both student model and generator.

Conclusion: CPSC-DFKD effectively addresses limitations of current DFKD methods and achieves better performance through conditional generation and contrastive learning.

Abstract: Data-free knowledge distillation~(DFKD) is an effective manner to solve model compression and transmission restrictions while retaining privacy protection, which has attracted extensive attention in recent years. Currently, the majority of existing methods utilize a generator to synthesize images to support the distillation. Although the current methods have achieved great success, there are still many issues to be explored. Firstly, the outstanding performance of supervised learning in deep learning drives us to explore a pseudo-supervised paradigm on DFKD. Secondly, current synthesized methods cannot distinguish the distributions of different categories of samples, thus producing ambiguous samples that may lead to an incorrect evaluation by the teacher. Besides, current methods cannot optimize the category-wise diversity samples, which will hinder the student model learning from diverse samples and further achieving better performance. In this paper, to address the above limitations, we propose a novel learning paradigm, i.e., conditional pseudo-supervised contrast for data-free knowledge distillation~(CPSC-DFKD). The primary innovations of CPSC-DFKD are: (1) introducing a conditional generative adversarial network to synthesize category-specific diverse images for pseudo-supervised learning, (2) improving the modules of the generator to distinguish the distributions of different categories, and (3) proposing pseudo-supervised contrastive learning based on teacher and student views to enhance diversity. Comprehensive experiments on three commonly-used datasets validate the performance lift of both the student and generator brought by CPSC-DFKD. The code is available at https://github.com/RoryShao/CPSC-DFKD.git

[732] A Robust Clustered Federated Learning Approach for Non-IID Data with Quantity Skew

Michael Ben Ali, Imen Megdiche, André Peninou, Olivier Teste

Main category: cs.LG

TL;DR: This paper evaluates CFL algorithms under Non-IID data with Quantity Skew and proposes CORNFLQS, a novel iterative CFL algorithm that coordinates both client selection and server grouping strategies, achieving superior accuracy and robustness.

DetailsMotivation: Address the challenge of Quantity Skew in Federated Learning, where most CFL methods lack systematic evaluation under heterogeneous data volumes despite its significant impact on model performance.

Method: Proposed CORNFLQS algorithm that iteratively coordinates between client selection (minimizing local loss) and server grouping (based on model similarities) strategies. Conducted extensive experiments on 6 image datasets with 270 Non-IID configurations.

Result: CORNFLQS achieved highest average ranking in both accuracy and clustering quality, demonstrating strong robustness to Quantity Skew perturbations and outperforming existing CFL algorithms.

Conclusion: The proposed CORNFLQS algorithm effectively addresses Quantity Skew in Federated Learning by optimally coordinating both CFL operating strategies, providing a robust solution that outperforms current state-of-the-art methods.

Abstract: Federated Learning (FL) is a decentralized paradigm that enables a client-server architecture to collaboratively train a global Artificial Intelligence model without sharing raw data, thereby preserving privacy. A key challenge in FL is Non-IID data. Quantity Skew (QS) is a particular problem of Non-IID, where clients hold highly heterogeneous data volumes. Clustered Federated Learning (CFL) is an emergent variant of FL that presents a promising solution to Non-IID problem. It improves models’ performance by grouping clients with similar data distributions into clusters. CFL methods generally fall into two operating strategies. In the first strategy, clients select the cluster that minimizes the local training loss. In the second strategy, the server groups clients based on local model similarities. However, most CFL methods lack systematic evaluation under QS but present significant challenges because of it. In this paper, we present two main contributions. The first one is an evaluation of state-of-the-art CFL algorithms under various Non-IID settings, applying multiple QS scenarios to assess their robustness. Our second contribution is a novel iterative CFL algorithm, named CORNFLQS, which proposes an optimal coordination between both operating strategies of CFL. Our approach is robust against the different variations of QS settings. We conducted intensive experiments on six image classification datasets, resulting in 270 Non-IID configurations. The results show that CORNFLQS achieves the highest average ranking in both accuracy and clustering quality, as well as strong robustness to QS perturbations. Overall, our approach outperforms actual CFL algorithms.

[733] Cross-Modal Reconstruction Pretraining for Ramp Flow Prediction at Highway Interchanges

Yongchao Li, Jun Chen, Zhuoxuan Li, Chao Gao, Yang Li, Chu Zhang, Changyin Dong

Main category: cs.LG

TL;DR: STDAE is a two-stage framework that reconstructs historical ramp flows from mainline data using cross-modal pretraining, then integrates learned representations with forecasting models to improve traffic prediction at interchanges without real-time ramp detectors.

DetailsMotivation: Interchanges lack real-time ramp detectors, creating blind spots in traffic prediction that need to be addressed.

Method: Propose Spatio-Temporal Decoupled Autoencoder (STDAE) with parallel spatial and temporal autoencoders for cross-modal reconstruction pretraining, then integrate learned representations with models like GWNet.

Result: STDAE-GWNet consistently outperforms 13 state-of-the-art baselines across three real-world interchange datasets and achieves performance comparable to models using historical ramp data.

Conclusion: STDAE effectively overcomes detector scarcity and demonstrates plug-and-play potential for diverse forecasting pipelines.

Abstract: Interchanges are crucial nodes for vehicle transfers between highways, yet the lack of real-time ramp detectors creates blind spots in traffic prediction. To address this, we propose a Spatio-Temporal Decoupled Autoencoder (STDAE), a two-stage framework that leverages cross-modal reconstruction pretraining. In the first stage, STDAE reconstructs historical ramp flows from mainline data, forcing the model to capture intrinsic spatio-temporal relations. Its decoupled architecture with parallel spatial and temporal autoencoders efficiently extracts heterogeneous features. In the prediction stage, the learned representations are integrated with models such as GWNet to enhance accuracy. Experiments on three real-world interchange datasets show that STDAE-GWNET consistently outperforms thirteen state-of-the-art baselines and achieves performance comparable to models using historical ramp data. This demonstrates its effectiveness in overcoming detector scarcity and its plug-and-play potential for diverse forecasting pipelines.

[734] Studying the Korean Word-Chain Game with RLVR:Mitigating Reward Conflicts via Curriculum Learning

Donghwan Rho

Main category: cs.LG

TL;DR: RLVR applied to Korean word-chain game shows curriculum learning mitigates conflicting rule-derived rewards.

DetailsMotivation: Study RLVR for LLMs on Korean word-chain game to address conflicting rewards in diverse language puzzles.

Method: Apply RLVR with curriculum learning to handle conflicting rule-derived rewards in the game.

Result: Curriculum learning effectively mitigates reward conflicts in the Korean word-chain game.

Conclusion: Puzzle tasks in diverse languages warrant further study with RLVR and curriculum learning.

Abstract: Reinforcement learning with verifiable rewards (RLVR) is a promising approach for training large language models (LLMs) with stronger reasoning abilities. It has also been applied to a variety of logic puzzles. In this work, we study the Korean word-chain game using RLVR. We show that rule-derived rewards can naturally conflict, and demonstrate through experiments that a curriculum-learning scheme mitigates these conflicts. Our findings motivate further studies of puzzle tasks in diverse languages.

[735] Training Variation of Physically-Informed Deep Learning Models

Ashley Lenau, Dennis Dimiduk, Stephen R. Niezgoda

Main category: cs.LG

TL;DR: This paper analyzes the reliability and reproducibility of training algorithms in deep learning, particularly focusing on physics-informed loss functions for enforcing boundary conditions in stress field prediction.

DetailsMotivation: There is insufficient discussion on the reliability and reproducibility of training algorithms, especially with the rise of physics-informed loss functions. The paper aims to assess how reliably loss functions can condition networks to enforce boundary conditions.

Method: Used a Pix2Pix network predicting stress fields of high elastic contrast composites as a case study. Implemented several different loss functions enforcing stress equilibrium and analyzed their performance across multiple training sessions.

Result: Different loss functions showed varying levels of variation in convergence, accuracy, and enforcement of stress equilibrium across training sessions, highlighting the importance of reporting model variation.

Conclusion: Reporting model variation is essential for assessing loss function reliability and provides fairer comparison among methods. The paper shares suggested practices for reporting model variation.

Abstract: A successful deep learning network is highly dependent not only on the training dataset, but the training algorithm used to condition the network for a given task. The loss function, dataset, and tuning of hyperparameters all play an essential role in training a network, yet there is not much discussion on the reliability or reproducibility of a training algorithm. With the rise in popularity of physics-informed loss functions, this raises the question of how reliable one’s loss function is in conditioning a network to enforce a particular boundary condition. Reporting the model variation is needed to assess a loss function’s ability to consistently train a network to obey a given boundary condition, and provides a fairer comparison among different methods. In this work, a Pix2Pix network predicting the stress fields of high elastic contrast composites is used as a case study. Several different loss functions enforcing stress equilibrium are implemented, with each displaying different levels of variation in convergence, accuracy, and enforcing stress equilibrium across many training sessions. Suggested practices in reporting model variation are also shared.

[736] Multi-task neural diffusion processes for uncertainty-quantified wind power prediction

Joseph Rawson, Domniki Ladopoulou, Petros Dellaportas

Main category: cs.LG

TL;DR: The paper proposes a multi-task neural diffusion process (MT-NDP) framework for uncertainty-aware wind power prediction, extending NDPs to capture cross-turbine correlations and enable few-shot adaptation to unseen turbines.

DetailsMotivation: Uncertainty-aware wind power prediction is essential for grid integration and reliable wind farm operation, requiring models that can provide calibrated predictions and adapt to different turbine behaviors.

Method: Extends neural diffusion processes (NDPs) to a multi-task framework (MT-NDP) with a task encoder to capture cross-turbine correlations, enabling few-shot adaptation to unseen turbines. First empirical evaluation of NDPs on real SCADA data.

Result: MT-NDP outperforms single-task NDPs and Gaussian processes in point accuracy and calibration, particularly for turbines deviating from fleet average. Provides sharper yet trustworthy predictive intervals suitable for operational deployment.

Conclusion: NDP-based models deliver calibrated and scalable predictions for wind power forecasting, offering reliable uncertainty estimates that can support dispatch and maintenance decisions in modern wind farms.

Abstract: Uncertainty-aware wind power prediction is essential for grid integration and reliable wind farm operation. We apply neural diffusion processes (NDPs)-a recent class of models that learn distributions over functions-and extend them to a multi-task NDP (MT-NDP) framework for wind power prediction. We provide the first empirical evaluation of NDPs in real supervisory control and data acquisition (SCADA) data. We introduce a task encoder within MT-NDPs to capture cross-turbine correlations and enable few-shot adaptation to unseen turbines. The proposed MT-NDP framework outperforms single-task NDPs and GPs in terms of point accuracy and calibration, particularly for wind turbines whose behaviour deviates from the fleet average. In general, NDP-based models deliver calibrated and scalable predictions suitable for operational deployment, offering sharper, yet trustworthy, predictive intervals that can support dispatch and maintenance decisions in modern wind farms.

[737] Memory-Efficient Backpropagation for Fine-Tuning LLMs on Resource-Constrained Mobile Devices

Congzheng Song, Xinyu Tang

Main category: cs.LG

TL;DR: MeBP is a memory-efficient backpropagation implementation for fine-tuning LLMs on mobile devices, enabling models up to 4B parameters to be trained with less than 1GB memory while maintaining better performance than zeroth-order optimization.

DetailsMotivation: Current fine-tuning methods for LLMs are impractical on mobile devices - backpropagation is too memory-intensive, while zeroth-order optimization has slow convergence (10-100x more steps).

Method: Proposed MeBP (memory-efficient backpropagation) that optimizes memory usage during fine-tuning on mobile devices, providing better trade-off between memory and compute time.

Result: Successfully fine-tuned various LLMs (0.5B to 4B parameters) on iPhone 15 Pro Max using less than 1GB memory, with faster convergence and better performance than ZO baseline.

Conclusion: MeBP enables practical fine-tuning of large language models on resource-constrained mobile devices, bridging the gap between memory efficiency and training performance.

Abstract: Fine-tuning large language models (LLMs) with backpropagation\textemdash even for a subset of parameters such as LoRA\textemdash can be much more memory-consuming than inference and is often deemed impractical for resource-constrained mobile devices. Alternative methods, such as zeroth-order optimization (ZO), can greatly reduce the memory footprint but come at the cost of significantly slower model convergence (10$\times$ to 100$\times$ more steps than backpropagation). We propose a memory-efficient implementation of backpropagation (MeBP) on mobile devices that provides better trade-off between memory usage and compute time, while converging faster and achieving better performance than the ZO baseline. We verify the effectiveness of MeBP on an iPhone 15 Pro Max and show that various LLMs, ranging from 0.5B to 4B parameters, can be fine-tuned using less than 1GB of memory. We release an example of the MeBP implementation at https://github.com/apple/ml-mebp.

[738] Generalized Orders of Magnitude for Scalable, Parallel, High-Dynamic-Range Computation

Franz A. Heinsen, Leo Kozachkov

Main category: cs.LG

TL;DR: GOOMs extend traditional orders of magnitude to enable stable computation over larger dynamic ranges than floating-point numbers, with efficient parallel implementation supporting applications like matrix products, Lyapunov exponents, and deep RNNs.

DetailsMotivation: Many domains require compounding real numbers over long sequences, leading to catastrophic numerical underflow or overflow that limits practical computation.

Method: Introduce generalized orders of magnitude (GOOMs) as principled extension of traditional orders of magnitude, implement with efficient custom parallel prefix scan for GPU execution.

Result: GOOMs outperform traditional approaches, enabling previously impractical computations: compounding matrix products beyond floating-point limits, faster Lyapunov exponent estimation with selective-resetting, and long-range dependencies in deep RNNs without stabilization.

Conclusion: GOOMs combined with efficient parallel scanning offer scalable and numerically robust alternative to conventional floating-point numbers for high-dynamic-range applications.

Abstract: Many domains, from deep learning to finance, require compounding real numbers over long sequences, often leading to catastrophic numerical underflow or overflow. We introduce generalized orders of magnitude (GOOMs), a principled extension of traditional orders of magnitude that incorporates floating-point numbers as a special case, and which in practice enables stable computation over significantly larger dynamic ranges of real numbers than previously possible. We implement GOOMs, along with an efficient custom parallel prefix scan, to support native execution on parallel hardware such as GPUs. We demonstrate that our implementation of GOOMs outperforms traditional approaches with three representative experiments, all of which were previously considered impractical or impossible, and now become possible and practical: (1) compounding real matrix products far beyond standard floating-point limits; (2) estimating spectra of Lyapunov exponents in parallel, orders of magnitude faster than with previous methods, applying a novel selective-resetting method to prevent state colinearity; and (3) capturing long-range dependencies in deep recurrent neural networks with non-diagonal recurrent states, computed in parallel via a prefix scan, without requiring any form of stabilization. Our results show that our implementation of GOOMs, combined with efficient parallel scanning, offers a scalable and numerically robust alternative to conventional floating-point numbers for high-dynamic-range applications.

[739] LHGEL: Large Heterogeneous Graph Ensemble Learning using Batch View Aggregation

Jiajun Shen, Yufei Jin, Yi He, Xingquan Zhu

Main category: cs.LG

TL;DR: LHGEL is an ensemble learning framework for large heterogeneous graphs that uses batch sampling with three key components: batch view aggregation, residual attention, and diversity regularization to capture graph heterogeneity while maintaining computational efficiency.

DetailsMotivation: Large heterogeneous graphs present challenges due to scale, node/edge type heterogeneity, feature variations, and complex local structures. Ensemble learning is proposed as a natural solution to capture different aspects of graph heterogeneity.

Method: LHGEL framework with three components: 1) Batch view aggregation samples subgraphs to form multiple graph views, 2) Residual attention adaptively weights view contributions to guide node embeddings, 3) Diversity regularization encourages representational disparity across embedding matrices from different views.

Result: Theoretical analysis shows residual attention mitigates gradient vanishing issues. Empirical results on five real heterogeneous networks demonstrate LHGEL consistently outperforms state-of-the-art competitors by substantial margins.

Conclusion: LHGEL effectively addresses challenges in learning from large heterogeneous graphs through ensemble learning with specialized components for view aggregation, attention weighting, and diversity promotion, achieving superior performance over existing methods.

Abstract: Learning from large heterogeneous graphs presents significant challenges due to the scale of networks, heterogeneity in node and edge types, variations in nodal features, and complex local neighborhood structures. This paper advocates for ensemble learning as a natural solution to this problem, whereby training multiple graph learners under distinct sampling conditions, the ensemble inherently captures different aspects of graph heterogeneity. Yet, the crux lies in combining these learners to meet global optimization objective while maintaining computational efficiency on large-scale graphs. In response, we propose LHGEL, an ensemble framework that addresses these challenges through batch sampling with three key components, namely batch view aggregation, residual attention, and diversity regularization. Specifically, batch view aggregation samples subgraphs and forms multiple graph views, while residual attention adaptively weights the contributions of these views to guide node embeddings toward informative subgraphs, thereby improving the accuracy of base learners. Diversity regularization encourages representational disparity across embedding matrices derived from different views, promoting model diversity and ensemble robustness. Our theoretical study demonstrates that residual attention mitigates gradient vanishing issues commonly faced in ensemble learning. Empirical results on five real heterogeneous networks validate that our LHGEL approach consistently outperforms its state-of-the-art competitors by substantial margin. Codes and datasets are available at https://github.com/Chrisshen12/LHGEL.

[740] Consistent Kernel Change-Point Detection under m-Dependence for Text Segmentation

Jairo Diaz-Rodriguez, Mumin Jia

Main category: cs.LG

TL;DR: Kernel change-point detection (KCPD) is proven consistent for m-dependent data like text, validated through LLM-based simulations and comprehensive empirical studies showing superior text segmentation performance with modern embeddings.

DetailsMotivation: Real-world sequential data such as text exhibits strong dependencies, but existing KCPD theory only establishes consistency under independence assumptions, creating a gap between theory and practice.

Method: Prove theoretical consistency guarantees for KCPD under m-dependent data, perform LLM-based simulation with synthetic m-dependent text, and conduct comprehensive empirical study of KCPD for text segmentation using modern embeddings across diverse datasets.

Result: KCPD achieves consistency in number of detected change points and weak consistency in their locations under m-dependence. Empirical results show KCPD with text embeddings outperforms baselines in standard text segmentation metrics across diverse datasets.

Conclusion: KCPD provides strong theoretical reliability under realistic dependency assumptions and practical effectiveness for text segmentation tasks, as demonstrated through both simulations and real-world case studies.

Abstract: Kernel change-point detection (KCPD) has become a widely used tool for identifying structural changes in complex data. While existing theory establishes consistency under independence assumptions, real-world sequential data such as text exhibits strong dependencies. We establish new guarantees for KCPD under $m$-dependent data: specifically, we prove consistency in the number of detected change points and weak consistency in their locations under mild additional assumptions. We perform an LLM-based simulation that generates synthetic $m$-dependent text to validate the asymptotics. To complement these results, we present the first comprehensive empirical study of KCPD for text segmentation with modern embeddings. Across diverse text datasets, KCPD with text embeddings outperforms baselines in standard text segmentation metrics. We demonstrate through a case study on Taylor Swift’s tweets that KCPD not only provides strong theoretical and simulated reliability but also practical effectiveness for text segmentation tasks.

[741] On residual network depth

Benoit Dherin, Michael Munn

Main category: cs.LG

TL;DR: Deep residual networks behave like ensembles of shallower models, with depth expansion mathematically equivalent to ensemble size growth, explaining the need for normalization layers.

DetailsMotivation: To formally understand why depth is effective in residual architectures and explain the historical necessity of normalization layers in deep models.

Method: Developed an explicit analytical formula (Residual Expansion Theorem) that proves depth expansion equals ensemble size growth, revealing hierarchical ensemble structure and combinatorial path explosion.

Result: Showed that scaling residual modules provides principled solution to control combinatorial explosion, acting as capacity control and implicit regularization.

Conclusion: Residual networks’ ensemble behavior explains normalization layer necessity, and scaling offers principled alternative to previous heuristic approaches like SkipInit and Fixup.

Abstract: Deep residual architectures, such as ResNet and the Transformer, have enabled models of unprecedented depth, yet a formal understanding of why depth is so effective remains an open question. A popular intuition, following Veit et al. (2016), is that these residual networks behave like ensembles of many shallower models. Our key finding is an explicit analytical formula that verifies this ensemble perspective, proving that increasing network depth is mathematically equivalent to expanding the size of this implicit ensemble. Furthermore, our expansion reveals a hierarchical ensemble structure in which the combinatorial growth of computation paths leads to an explosion in the output signal, explaining the historical necessity of normalization layers in training deep models. This insight offers a first principles explanation for the historical dependence on normalization layers and sheds new light on a family of successful normalization-free techniques like SkipInit and Fixup. However, while these previous approaches infer scaling factors through optimizer analysis or a heuristic analogy to Batch Normalization, our work offers the first explanation derived directly from the network’s inherent functional structure. Specifically, our Residual Expansion Theorem reveals that scaling each residual module provides a principled solution to taming the combinatorial explosion inherent to these architectures. We further show that this scaling acts as a capacity controls that also implicitly regularizes the model’s complexity.

[742] How to Set $β_1, β_2$ in Adam: An Online Learning Perspective

Quan Nguyen

Main category: cs.LG

TL;DR: This paper provides novel theoretical analyses for Adam optimizer’s momentum parameters β₁ and β₂, covering both β₁ ≥ √β₂ and β₁ ≤ √β₂ cases, with tight worst-case bounds and insights on optimal parameter settings.

DetailsMotivation: While Adam is widely used in practice, theoretical understanding of how to optimally set its momentum factors β₁ and β₂ remains incomplete, especially for practical cases where β₁ ≠ √β₂.

Method: The authors derive novel analyses by viewing Adam as an instance of Follow-the-Regularized-Leader (FTRL) and provide general analyses that work for both β₁ ≥ √β₂ and β₁ ≤ √β₂ cases.

Result: The new analyses strictly generalize existing bounds, are tight in the worst case, and show that β₁ = √β₂ is optimal for oblivious adversaries but sub-optimal for non-oblivious adversaries.

Conclusion: This work provides comprehensive theoretical understanding of Adam’s momentum parameters, offering practical guidance for parameter tuning in different adversarial settings.

Abstract: While Adam is one of the most effective optimizer for training large-scale machine learning models, a theoretical understanding of how to optimally set its momentum factors, $\beta_1$ and $\beta_2$, remains largely incomplete. Prior works have shown that Adam can be seen as an instance of Follow-the-Regularized-Leader (FTRL), one of the most important class of algorithms in online learning. The prior analyses in these works required setting $\beta_1 = \sqrt{\beta_2}$, which does not cover the more practical cases with $\beta_1 \neq \sqrt{\beta_2}$. We derive novel, more general analyses that hold for both $\beta_1 \geq \sqrt{\beta_2}$ and $\beta_1 \leq \sqrt{\beta_2}$. In both cases, our results strictly generalize the existing bounds. Furthermore, we show that our bounds are tight in the worst case. We also prove that setting $\beta_1 = \sqrt{\beta_2}$ is optimal for an oblivious adversary, but sub-optimal for an non-oblivious adversary.

[743] Reasoning-based Anomaly Detection Framework: A Real-time, Scalable, and Automated Approach to Anomaly Detection Across Domains

Anupam Panwar, Himadri Pal, Jiali Chen, Kyle Cho, Riddick Jiang, Miao Zhao, Rajiv Krishnamurthy

Main category: cs.LG

TL;DR: RADF is a unified framework for real-time anomaly detection in large distributed systems that addresses data volume, dataset heterogeneity, and root-cause analysis challenges through automated algorithm selection and post-detection capabilities.

DetailsMotivation: Address three key challenges in anomaly detection: handling large data volumes in high-throughput environments, managing heterogeneous time-series datasets across multiple domains, and determining root-causes of anomalies in real-time.

Method: Uses Reasoning based Anomaly Detection Framework (RADF) with mSelect technique for automated algorithm selection and hyper-parameter tuning, plus post-detection capabilities for faster triaging and root-cause determination.

Result: Outperformed state-of-the-art models in AUC performance for 5 out of 9 public datasets, achieving AUC over 0.85 for 7 out of 9 datasets - unmatched by any other model.

Conclusion: RADF provides an effective unified solution for real-time anomaly detection in large-scale distributed systems, successfully addressing the identified challenges through automation and enhanced detection capabilities.

Abstract: Detecting anomalies in large, distributed systems presents several challenges. The first challenge arises from the sheer volume of data that needs to be processed. Flagging anomalies in a high-throughput environment calls for a careful consideration of both algorithm and system design. The second challenge comes from the heterogeneity of time-series datasets that leverage such a system in production. In practice, anomaly detection systems are rarely deployed for a single use case. Typically, there are several metrics to monitor, often across several domains (e.g. engineering, business and operations). A one-size-fits-all approach rarely works, so these systems need to be fine-tuned for every application - this is often done manually. The third challenge comes from the fact that determining the root-cause of anomalies in such settings is akin to finding a needle in a haystack. Identifying (in real time) a time-series dataset that is associated causally with the anomalous time-series data is a very difficult problem. In this paper, we describe a unified framework that addresses these challenges. Reasoning based Anomaly Detection Framework (RADF) is designed to perform real time anomaly detection on very large datasets. This framework employs a novel technique (mSelect) that automates the process of algorithm selection and hyper-parameter tuning for each use case. Finally, it incorporates a post-detection capability that allows for faster triaging and root-cause determination. Our extensive experiments demonstrate that RADF, powered by mSelect, surpasses state-of-the-art anomaly detection models in AUC performance for 5 out of 9 public benchmarking datasets. RADF achieved an AUC of over 0.85 for 7 out of 9 datasets, a distinction unmatched by any other state-of-the-art model.

[744] Trajectory Data Suffices for Statistically Efficient Policy Evaluation in Finite-Horizon Offline RL with Linear $q^π$-Realizability and Concentrability

Volodymyr Tkachuk, Csaba Szepesvári, Xiaoqi Tan

Main category: cs.LG

TL;DR: This paper presents a statistically efficient learner for offline policy evaluation in finite-horizon RL with function approximation, improving upon prior impossibility results by assuming trajectory data and qπ-realizability.

DetailsMotivation: Prior work showed that statistically efficient learning is impossible for offline RL with only concentrability and qπ-realizability assumptions. Recent work achieved efficient policy optimization with trajectory data, but policy evaluation remained unaddressed.

Method: The authors develop a statistically efficient learner for policy evaluation under the same assumptions as Tkachuk et al. (2024) - trajectory data, concentrability, and qπ-realizability. They also provide a tighter analysis to improve sample complexity bounds for policy optimization.

Result: The paper successfully establishes a statistically efficient learner for policy evaluation in the finite-horizon offline RL setting with function approximation, and demonstrates improved sample complexity for policy optimization through tighter analysis.

Conclusion: This work resolves the policy evaluation problem in offline RL by leveraging trajectory data assumptions, complementing recent advances in policy optimization and providing improved theoretical guarantees.

Abstract: We study finite-horizon offline reinforcement learning (RL) with function approximation for both policy evaluation and policy optimization. Prior work established that statistically efficient learning is impossible for either of these problems when the only assumptions are that the data has good coverage (concentrability) and the state-action value function of every policy is linearly realizable ($q^\pi$-realizability) (Foster et al., 2021). Recently, Tkachuk et al. (2024) gave a statistically efficient learner for policy optimization, if in addition the data is assumed to be given as trajectories. In this work we present a statistically efficient learner for policy evaluation under the same assumptions. Further, we show that the sample complexity of the learner used by Tkachuk et al. (2024) for policy optimization can be improved by a tighter analysis.

[745] D2 Actor Critic: Diffusion Actor Meets Distributional Critic

Lunjun Zhang, Shuo Han, Hanrui Lyu, Bradly C Stadie

Main category: cs.LG

TL;DR: D2AC is a model-free RL algorithm that trains diffusion policies online using a stable policy improvement objective and a robust distributional critic, achieving SOTA performance on 18 hard RL tasks.

DetailsMotivation: To address the high variance of typical policy gradients and complexity of backpropagation through time in training expressive diffusion policies for RL.

Method: Uses a policy improvement objective that avoids high variance gradients, combined with a robust distributional critic based on distributional RL and clipped double Q-learning.

Result: Achieved state-of-the-art performance on 18 challenging RL tasks including Humanoid, Dog, and Shadow Hand domains, covering both dense-reward and goal-conditioned scenarios.

Conclusion: D2AC is highly effective for training diffusion policies online and demonstrates strong behavioral robustness and generalization capacity in complex RL environments.

Abstract: We introduce D2AC, a new model-free reinforcement learning (RL) algorithm designed to train expressive diffusion policies online effectively. At its core is a policy improvement objective that avoids the high variance of typical policy gradients and the complexity of backpropagation through time. This stable learning process is critically enabled by our second contribution: a robust distributional critic, which we design through a fusion of distributional RL and clipped double Q-learning. The resulting algorithm is highly effective, achieving state-of-the-art performance on a benchmark of eighteen hard RL tasks, including Humanoid, Dog, and Shadow Hand domains, spanning both dense-reward and goal-conditioned RL scenarios. Beyond standard benchmarks, we also evaluate a biologically motivated predator-prey task to examine the behavioral robustness and generalization capacity of our approach.

[746] Task-Level Contrastiveness for Cross-Domain Few-Shot Learning

Kristi Topollai, Anna Choromanska

Main category: cs.LG

TL;DR: The paper introduces task-level contrastiveness to improve few-shot classification and meta-learning by enabling better generalization across diverse domains through unsupervised clustering of task representations.

DetailsMotivation: Existing few-shot classification and meta-learning methods struggle with generalization across diverse domains, suffer from low accuracy, high computational costs, and rely on restrictive assumptions.

Method: Proposes task-level contrastive learning with task augmentations and a contrastive loss that encourages unsupervised clustering of task representations, which can be integrated into existing few-shot/meta-learning algorithms.

Result: The method achieves superior performance on the MetaDataset benchmark, showing improved generalization and computational efficiency without requiring prior domain knowledge.

Conclusion: Task-level contrastiveness provides a lightweight, effective solution for cross-domain generalization in few-shot learning, delivering significant benefits without additional complexity.

Abstract: Few-shot classification and meta-learning methods typically struggle to generalize across diverse domains, as most approaches focus on a single dataset, failing to transfer knowledge across various seen and unseen domains. Existing solutions often suffer from low accuracy, high computational costs, and rely on restrictive assumptions. In this paper, we introduce the notion of task-level contrastiveness, a novel approach designed to address issues of existing methods. We start by introducing simple ways to define task augmentations, and thereafter define a task-level contrastive loss that encourages unsupervised clustering of task representations. Our method is lightweight and can be easily integrated within existing few-shot/meta-learning algorithms while providing significant benefits. Crucially, it leads to improved generalization and computational efficiency without requiring prior knowledge of task domains. We demonstrate the effectiveness of our approach through different experiments on the MetaDataset benchmark, where it achieves superior performance without additional complexity.

[747] A Lightweight Federated Learning Approach for Privacy-Preserving Botnet Detection in IoT

Taha M. Mahmoud, Naima Kaabouch

Main category: cs.LG

TL;DR: A lightweight, privacy-preserving IoT botnet detection framework using federated learning that enables collaborative model training without sharing raw data, achieving high accuracy with reduced communication costs.

DetailsMotivation: The rapid IoT growth increases botnet attack risks, while conventional detection methods struggle with scalability, privacy, and adaptability in resource-constrained IoT environments.

Method: Federated learning-based framework with distributed devices collaboratively training models without raw data exchange, using communication-efficient aggregation to reduce overhead.

Result: Experiments on benchmark IoT botnet datasets show high detection accuracy while substantially reducing communication costs.

Conclusion: Federated learning provides a practical path toward scalable, secure, and privacy-aware intrusion detection for IoT ecosystems.

Abstract: The rapid growth of the Internet of Things (IoT) has expanded opportunities for innovation but also increased exposure to botnet-driven cyberattacks. Conventional detection methods often struggle with scalability, privacy, and adaptability in resource-constrained IoT environments. To address these challenges, we present a lightweight and privacy-preserving botnet detection framework based on federated learning. This approach enables distributed devices to collaboratively train models without exchanging raw data, thus maintaining user privacy while preserving detection accuracy. A communication-efficient aggregation strategy is introduced to reduce overhead, ensuring suitability for constrained IoT networks. Experiments on benchmark IoT botnet datasets demonstrate that the framework achieves high detection accuracy while substantially reducing communication costs. These findings highlight federated learning as a practical path toward scalable, secure, and privacy-aware intrusion detection for IoT ecosystems.

[748] RAPID: An Efficient Reinforcement Learning Algorithm for Small Language Models

Lianghuan Huang, Sagnik Anupam, Insup Lee, Shuo Li, Osbert Bastani

Main category: cs.LG

TL;DR: RAPID is a novel RL algorithm that reduces training time by 11%-34% for finetuning small language models through large-batch inference and off-policy policy gradient updates with group advantage estimation.

DetailsMotivation: RL algorithms are resource-intensive and time-consuming for finetuning small language models on tasks like math and coding, due to the computational costs of both inference and backpropagation during training.

Method: RAPID performs inference in large batches to maximize computational resource usage, then conducts off-policy policy gradient updates in mini-batches using group advantage estimation and importance weighted estimators to correct for off-policy bias.

Result: Experiments show RAPID reduces running time by 11%-34% on three benchmarks compared to state-of-the-art RL algorithms while maintaining similar or better accuracy.

Conclusion: RAPID successfully addresses the computational inefficiency of RL training for language models through optimized batch processing and off-policy learning techniques, achieving significant time savings without sacrificing performance.

Abstract: Reinforcement learning (RL) has emerged as a promising strategy for finetuning small language models (SLMs) to solve targeted tasks such as math and coding. However, RL algorithms tend to be resource-intensive, taking a significant amount of time to train. We propose RAPID, a novel RL algorithm that can substantially reduce the running time of RL. Our key insight is that RL tends to be costly due to the need to perform both inference and backpropagation during training. To maximize use of computational resources, our algorithm performs inference in large batches, and then performs off-policy policy gradient updates in mini-batches. For off-policy updates, we incorporate group advantage estimation into the policy gradient algorithm, and derive an importance weighted estimator to correct for the bias arising from off-policy learning. Our experiments demonstrate that our algorithm can reduce running time by 11%-34% on three benchmarks compared to state-of-the-art RL algorithms while maintaining similar or better accuracy.

[749] Certifiable Safe RLHF: Fixed-Penalty Constraint Optimization for Safer Language Models

Kartik Pandit, Sourav Ganguly, Arnesh Banerjee, Shaahin Angizi, Arnob Ghosh

Main category: cs.LG

TL;DR: CS-RLHF introduces a certifiable safety framework for LLMs using a rectified penalty-based approach with semantically grounded safety scores, eliminating dual-variable updates and providing provable safety guarantees.

DetailsMotivation: Current CMDP-based safety methods have limitations: sensitivity to reward/cost functions and computational expense of dual-variable tuning without provable safety guarantees against adversarial jailbreaks.

Method: Uses a cost model trained on large-scale corpus for semantically grounded safety scores, adopts rectified penalty-based formulation based on exact penalty functions theory, eliminating dual-variable updates.

Result: Outperforms state-of-the-art LLM responses, achieving at least 5 times efficiency against both nominal and jail-breaking prompts.

Conclusion: CS-RLHF provides a more effective and certifiable safety framework for LLMs with provable safety guarantees and improved efficiency.

Abstract: Ensuring safety is a foundational requirement for large language models (LLMs). Achieving an appropriate balance between enhancing the utility of model outputs and mitigating their potential for harm is a complex and persistent challenge. Contemporary approaches frequently formalize this problem within the framework of Constrained Markov Decision Processes (CMDPs) and employ established CMDP optimization techniques. However, these methods exhibit two notable limitations. First, their reliance on reward and cost functions renders performance highly sensitive to the underlying scoring mechanism, which must capture semantic meaning rather than being triggered by superficial keywords. Second, CMDP-based training entails tuning dual-variable, a process that is both computationally expensive and does not provide any provable safety guarantee for a fixed dual variable that can be exploitable through adversarial jailbreaks. To overcome these limitations, we introduce Certifiable Safe-RLHF (CS-RLHF) that introduces a cost model trained on a large-scale corpus to assign semantically grounded safety scores. In contrast to the lagrangian-based approach, CS-RLHF adopts a rectified penalty-based formulation. This design draws on the theory of exact penalty functions in constrained optimization, wherein constraint satisfaction is enforced directly through a suitably chosen penalty term. With an appropriately scaled penalty, feasibility of the safety constraints can be guaranteed at the optimizer, eliminating the need for dual-variable updates. Empirical evaluation demonstrates that CS-RLHF outperforms state-of-the-art LLM model responses rendering at-least 5 times efficient against nominal and jail-breaking prompts

[750] Sequential decoder training for improved latent space dynamics identification

William Anderson, Seung Whan Chung, Youngsoo Choi

Main category: cs.LG

TL;DR: Multi-stage LaSDI (mLaSDI) improves reduced-order modeling by sequentially learning decoders to correct residual errors, achieving better accuracy and faster training than standard LaSDI for the Vlasov equation.

DetailsMotivation: Standard LaSDI can compromise reconstruction accuracy when enforcing latent dynamics during training, creating a need for improved methods that maintain accuracy while learning interpretable latent dynamics.

Method: Multi-stage framework that sequentially learns additional decoders to correct residual errors from previous stages, building upon the LaSDI approach that combines autoencoders with equation discovery.

Result: mLaSDI consistently outperforms standard LaSDI on the 1D-1V Vlasov equation, achieving lower prediction errors and reduced training time across various architectures.

Conclusion: The multi-stage approach successfully improves both reconstruction and prediction accuracy while maintaining the interpretable latent dynamics framework of LaSDI.

Abstract: Accurate numerical solutions of partial differential equations are essential in many scientific fields but often require computationally expensive solvers, motivating reduced-order models (ROMs). Latent Space Dynamics Identification (LaSDI) is a data-driven ROM framework that combines autoencoders with equation discovery to learn interpretable latent dynamics. However, enforcing latent dynamics during training can compromise reconstruction accuracy of the model for simulation data. We introduce multi-stage LaSDI (mLaSDI), a framework that improves reconstruction and prediction accuracy by sequentially learning additional decoders to correct residual errors from previous stages. Applied to the 1D-1V Vlasov equation, mLaSDI consistently outperforms standard LaSDI, achieving lower prediction errors and reduced training time across a wide range of architectures.

[751] CrossLag: Predicting Major Dengue Outbreaks with a Domain Knowledge Informed Transformer

Ashwin Prabu, Nhat Thanh Tran, Guofa Zhou, Jack Xin

Main category: cs.LG

TL;DR: CrossLag introduces environmentally informed attention to incorporate lagging endogenous signals from exogenous climate data into transformer architecture, improving major dengue outbreak prediction.

DetailsMotivation: Existing models struggle to predict major dengue outbreaks that require timely public warnings, despite various forecasting approaches being developed.

Method: Uses CrossLag attention mechanism that incorporates lagging endogenous signals from exogenous climate data into TimeXer transformer architecture with low parameter counts.

Result: Outperforms TimeXer baseline by considerable margin in detecting and predicting major outbreaks in Singapore dengue data over 24-week prediction window.

Conclusion: CrossLag effectively captures outbreak-lagging relationships with climate anomalies, enabling better prediction of major dengue outbreaks that need timely warnings.

Abstract: A variety of models have been developed to forecast dengue cases to date. However, it remains a challenge to predict major dengue outbreaks that need timely public warnings the most. In this paper, we introduce CrossLag, an environmentally informed attention that allows for the incorporation of lagging endogenous signals behind the significant events in the exogenous data into the architecture of the transformer at low parameter counts. Outbreaks typically lag behind major changes in climate and oceanic anomalies. We use TimeXer, a recent general-purpose transformer distinguishing exogenous-endogenous inputs, as the baseline for this study. Our proposed model outperforms TimeXer by a considerable margin in detecting and predicting major outbreaks in Singapore dengue data over a 24-week prediction window.

[752] Machine Unlearning Meets Adversarial Robustness via Constrained Interventions on LLMs

Fatmazohra Rezkellah, Ramzi Dakhmouche

Main category: cs.LG

TL;DR: This paper proposes a unified constrained optimization approach for LLM safety that addresses both sensitive information unlearning and jail-breaking robustness through minimal weight interventions.

DetailsMotivation: With increasing LLM adoption, there's growing need for privacy-preserving and safe generation, specifically addressing unlearning of sensitive information and robustness to jail-breaking attacks.

Method: Proposes constrained optimization formulations that find smallest interventions on LLM weights to either make given vocabulary unreachable or embed robustness by shifting weights to safer regions, without requiring oracle classifiers.

Result: The simple point-wise constraint-based intervention outperforms max-min interventions with lower computational cost, and shows superior performance compared to state-of-the-art defense methods.

Conclusion: The unified approach effectively addresses both unlearning and robustness requirements through efficient weight interventions, offering a practical solution for LLM safety without computational overhead of oracle classifiers.

Abstract: With the increasing adoption of Large Language Models (LLMs), more customization is needed to ensure privacy-preserving and safe generation. We address this objective from two critical aspects: unlearning of sensitive information and robustness to jail-breaking attacks. We investigate various constrained optimization formulations that address both aspects in a \emph{unified manner}, by finding the smallest possible interventions on LLM weights that either make a given vocabulary set unreachable or embed the LLM with robustness to tailored attacks by shifting part of the weights to a \emph{safer} region. Beyond unifying two key properties, this approach contrasts with previous work in that it doesn’t require an oracle classifier that is typically not available or represents a computational overhead. Surprisingly, we find that the simplest point-wise constraint-based intervention we propose leads to better performance than max-min interventions, while having a lower computational cost. Comparison against state-of-the-art defense methods demonstrates superior performance of the proposed approach.

[753] Longitudinal Flow Matching for Trajectory Modeling

Mohammad Mohaiminul Islam, Thijs P. Kuipers, Sharvaree Vadgama, Coen de Vente, Afsana Khan, Clara I. Sánchez, Erik J. Bekkers

Main category: cs.LG

TL;DR: IMMFM learns continuous stochastic dynamics from sparsely sampled high-dimensional trajectories using piecewise-quadratic interpolation paths and jointly optimized drift and diffusion coefficients.

DetailsMotivation: Existing generative models struggle with sparsely sampled high-dimensional trajectories and often reduce dynamics learning to pairwise transitions, failing to capture continuous stochastic processes.

Method: Uses piecewise-quadratic interpolation paths as smooth targets for flow matching, jointly optimizes drift and data-driven diffusion coefficients with theoretical stability guarantees.

Result: Outperforms existing methods in forecasting accuracy and downstream tasks on synthetic benchmarks and real-world longitudinal neuroimaging datasets.

Conclusion: IMMFM effectively captures intrinsic stochasticity, handles irregular sparse sampling, and generates subject-specific trajectories for sequential data.

Abstract: Generative models for sequential data often struggle with sparsely sampled and high-dimensional trajectories, typically reducing the learning of dynamics to pairwise transitions. We propose \textit{Interpolative Multi-Marginal Flow Matching} (IMMFM), a framework that learns continuous stochastic dynamics jointly consistent with multiple observed time points. IMMFM employs a piecewise-quadratic interpolation path as a smooth target for flow matching and jointly optimizes drift and a data-driven diffusion coefficient, supported by a theoretical condition for stable learning. This design captures intrinsic stochasticity, handles irregular sparse sampling, and yields subject-specific trajectories. Experiments on synthetic benchmarks and real-world longitudinal neuroimaging datasets show that IMMFM outperforms existing methods in both forecasting accuracy and further downstream tasks.

[754] Generalization of Graph Neural Network Models for Distribution Grid Fault Detection

Burak Karabulut, Carlo Manna, Chris Develder

Main category: cs.LG

TL;DR: This paper benchmarks various Graph Neural Network architectures in RNN+GNN pipelines for power grid fault detection, showing RGATv2 has superior generalization across different grid topologies.

DetailsMotivation: Fault detection in power grids needs robustness to evolving topologies from reconfigurations and DER integration. Current RGNN methods use basic GCNs, but more advanced GNN architectures exist that haven't been explored for this domain.

Method: Systematic benchmarking of GNN architectures (GraphSAGE, GAT, GATv2) in RNN+GNN pipelines against existing RGCN and pure RNN models, with focus on generalization across different topology settings.

Result: RGATv2 showed superior generalization with only ~12% F1-score reduction across topologies, while pure RNN models failed (~60% reduction) and other RGNN variants degraded up to ~25% lower F1-scores.

Conclusion: RGATv2 in RNN+GNN pipelines provides the most robust fault detection solution for evolving power grid topologies, significantly outperforming existing methods in generalization capability.

Abstract: Fault detection in power distribution grids is critical for ensuring system reliability and preventing costly outages. Moreover, fault detection methodologies should remain robust to evolving grid topologies caused by factors such as reconfigurations, equipment failures, and Distributed Energy Resource (DER) integration. Current data-driven state-of-the-art methods use Recurrent Neural Networks (RNNs) for temporal modeling and Graph Neural Networks (GNNs) for spatial learning, in an RNN+GNN pipeline setting (RGNN in short). Specifically, for power system fault diagnosis, Graph Convolutional Networks (GCNs) have been adopted. Yet, various more advanced GNN architectures have been proposed and adopted in domains outside of power systems. In this paper, we set out to systematically and consistently benchmark various GNN architectures in an RNN+GNN pipeline model. Specifically, to the best of our knowledge, we are the first to (i) propose to use GraphSAGE and Graph Attention (GAT, GATv2) in an RGNN for fault diagnosis, and (ii) provide a comprehensive benchmark against earlier proposed RGNN solutions (RGCN) as well as pure RNN models (especially Gated Recurrent Unit (GRU)), particularly (iii) exploring their generalization potential for deployment in different settings than those used for training them. Our experimental results on the IEEE 123-node distribution network show that RGATv2 has superior generalization capabilities, maintaining high performance with an F1-score reduction of $\sim$12% across different topology settings. In contrast, pure RNN models largely fail, experiencing an F1-score reduction of up to $\sim$60%, while other RGNN variants also exhibit significant performance degradation, i.e., up to $\sim$25% lower F1-scores.

[755] Efficient Test-Time Scaling for Small Vision-Language Models

Mehmet Onurcan Kaya, Desmond Elliott, Dim P. Papadopoulos

Main category: cs.LG

TL;DR: The paper proposes two efficient test-time scaling strategies (TTAug and TTAdapt) for small vision-language models to improve performance without compromising computational efficiency.

DetailsMotivation: Small VLMs are computationally efficient but suffer from weaker generalization and task performance. Existing test-time scaling methods are too computationally demanding, contradicting the resource-efficient goals of small models.

Method: Two strategies: (1) TTAug - generates multiple augmented inputs and aggregates outputs at token level without parameter updates, (2) TTAdapt - adapts model parameters during inference using consensus-based pseudolabels from TTAug.

Result: Consistent performance improvements across nine benchmarks while maintaining computational efficiency suitable for resource-constrained environments. The approach works across different model scales and VLMs without additional tuning.

Conclusion: The proposed test-time scaling strategies effectively enhance small VLM performance while preserving computational efficiency, demonstrating broad applicability across different model scales and architectures.

Abstract: Small Vision-Language Models (VLMs) provide a computationally efficient alternative to larger models, at the cost of weaker generalization abilities and downstream task performance. These shortcomings could be addressed by test-time scaling techniques, but existing methods are typically computationally demanding, contradicting the resource-efficient design goals of small models. To address these limitations, we propose two novel and efficient test-time scaling strategies that leverage the model-internal features rather than external supervision: (i) Test-Time Augmentation (TTAug), which generates multiple augmented inputs and aggregates outputs at the token level without parameter updates, and (ii) Test-Time Adaptation (TTAdapt), which adapts model parameters during inference using consensus-based pseudolabels from TTAug. Through extensive experiments across nine benchmarks, we demonstrate consistent performance improvements while maintaining computational efficiency suitable for resource-constrained environments. The generality of our approach is demonstrated both within models at different scales and across different VLMs without additional tuning.

[756] BEKAN: Boundary condition-guaranteed evolutionary Kolmogorov-Arnold networks with radial basis functions for solving PDE problems

Bongseok Kim, Jiahao Zhang, Guang Lin

Main category: cs.LG

TL;DR: BEKAN is a boundary condition-guaranteed evolutionary KAN that uses radial basis functions to enforce Dirichlet, periodic, and Neumann boundary conditions, outperforming MLP and B-splines KAN in PDE simulations.

DetailsMotivation: Deep learning struggles with precise boundary condition enforcement in PDE solving due to neural networks' black-box nature, requiring methods that can rigorously incorporate boundary constraints.

Method: Three approaches: Gaussian RBFs for Dirichlet problems, periodic layers with sinusoidal functions for periodic problems, and least-squares formulation for Neumann problems, all integrated into an evolutionary KAN framework.

Result: BEKAN achieves higher accuracy than multilayer perceptron and B-splines KAN across Dirichlet, Neumann, periodic, and mixed boundary value problems in extensive numerical experiments.

Conclusion: BEKAN enhances KANs’ capability for solving PDEs with boundary condition satisfaction, advancing scientific computing and engineering applications.

Abstract: Deep learning has gained attention for solving PDEs, but the black-box nature of neural networks hinders precise enforcement of boundary conditions. To address this, we propose a boundary condition-guaranteed evolutionary Kolmogorov-Arnold Network (KAN) with radial basis functions (BEKAN). In BEKAN, we propose three distinct and combinable approaches for incorporating Dirichlet, periodic, and Neumann boundary conditions into the network. For Dirichlet problem, we use smooth and global Gaussian RBFs to construct univariate basis functions for approximating the solution and to encode boundary information at the activation level of the network. To handle periodic problems, we employ a periodic layer constructed from a set of sinusoidal functions to enforce the boundary conditions exactly. For a Neumann problem, we devise a least-squares formulation to guide the parameter evolution toward satisfying the Neumann condition. By virtue of the boundary-embedded RBFs, the periodic layer, and the evolutionary framework, we can perform accurate PDE simulations while rigorously enforcing boundary conditions. For demonstration, we conducted extensive numerical experiments on Dirichlet, Neumann, periodic, and mixed boundary value problems. The results indicate that BEKAN outperforms both multilayer perceptron (MLP) and B-splines KAN in terms of accuracy. In conclusion, the proposed approach enhances the capability of KANs in solving PDE problems while satisfying boundary conditions, thereby facilitating advancements in scientific computing and engineering applications.

[757] Latent Mixture of Symmetries for Sample-Efficient Dynamic Learning

Haoran Li, Chenhan Xiao, Muhao Guo, Yang Weng

Main category: cs.LG

TL;DR: Latent Mixture of Symmetries (Latent MoS) is a model that captures mixtures of symmetry-governed latent factors from dynamical measurements, improving sample efficiency in learning dynamics for control and RL applications.

DetailsMotivation: Limited system measurements from low-resolution sensors demand sample-efficient learning. Existing methods assume single global symmetry groups and separate symmetry discovery from dynamic learning, leading to limited expressiveness and error accumulation.

Method: Proposes Latent MoS model that captures mixture of symmetry-governed latent factors while preserving underlying symmetric transformations locally and provably. Uses hierarchical architecture with stacked MoS blocks to capture long-term equivariance.

Result: Numerical experiments in diverse physical systems show Latent MoS outperforms state-of-the-art baselines in interpolation and extrapolation tasks while providing interpretable latent representations.

Conclusion: Latent MoS offers improved sample efficiency and expressiveness for learning dynamics in engineering systems, with interpretable representations suitable for geometric and safety-critical analyses.

Abstract: Learning dynamics is essential for model-based control and Reinforcement Learning in engineering systems, such as robotics and power systems. However, limited system measurements, such as those from low-resolution sensors, demand sample-efficient learning. Symmetry provides a powerful inductive bias by characterizing equivariant relations in system states to improve sample efficiency. While recent methods attempt to discover symmetries from data, they typically assume a single global symmetry group and treat symmetry discovery and dynamic learning as separate tasks, leading to limited expressiveness and error accumulation. In this paper, we propose the Latent Mixture of Symmetries (Latent MoS), an expressive model that captures a mixture of symmetry-governed latent factors from complex dynamical measurements. Latent MoS focuses on dynamic learning while locally and provably preserving the underlying symmetric transformations. To further capture long-term equivariance, we introduce a hierarchical architecture that stacks MoS blocks. Numerical experiments in diverse physical systems demonstrate that Latent MoS outperforms state-of-the-art baselines in interpolation and extrapolation tasks while offering interpretable latent representations suitable for future geometric and safety-critical analyses.

[758] FieldFormer: Physics-Informed Transformers for Spatio-Temporal Field Reconstruction from Sparse Sensors

Ankit Bhardwaj, Ananth Balashankar, Lakshminarayanan Subramanian

Main category: cs.LG

TL;DR: FieldFormer is a transformer-based framework for mesh-free spatio-temporal field reconstruction that combines data-driven learning with physics-based structure, achieving over 40% improvement over baselines on sparse, noisy data.

DetailsMotivation: Spatio-temporal sensor data is often sparse, noisy, and irregular, and existing methods struggle because they either ignore governing PDEs or don't scale well.

Method: Uses a transformer-based framework with learnable velocity-scaled distance metric to gather local neighborhoods, refined via expectation-maximization style updates, with physics consistency enforced through autograd-based PDE residuals and boundary penalties.

Result: Outperforms strong baselines by more than 40% across three benchmarks (scalar anisotropic heat equation, vector-valued shallow-water system, realistic advection-diffusion pollution simulation), achieving RMSE < 10^-2 with sparse (0.4%-2%) and noisy (10%) data.

Conclusion: FieldFormer enables accurate, efficient, and physically consistent field reconstruction from sparse and noisy spatio-temporal data.

Abstract: Spatio-temporal sensor data is often sparse, noisy, and irregular, and existing interpolation or learning methods struggle here because they either ignore governing PDEs or do not scale. We introduce FieldFormer, a transformer-based framework for mesh-free spatio-temporal field reconstruction that combines data-driven flexibility with physics-based structure. For each query, FieldFormer gathers a local neighborhood using a learnable velocity-scaled distance metric, enabling anisotropic adaptation to different propagation regimes. Neighborhoods are built efficiently via per-batch offset recomputation, and refined in an expectation-maximization style as the velocity scales evolve. Predictions are made by a local transformer encoder, and physics consistency is enforced through autograd-based PDE residuals and boundary-specific penalties. Across three benchmarks–a scalar anisotropic heat equation, a vector-valued shallow-water system, and a realistic advection-diffusion pollution simulation–FieldFormer consistently outperforms strong baselines by more than 40%. Our results demonstrate that FieldFormer enables accurate (RMSE$<10^{-2}$), efficient, and physically consistent field reconstruction from sparse (0.4%-2%) and noisy(10%) data.

[759] MECKD: Deep Learning-Based Fall Detection in Multilayer Mobile Edge Computing With Knowledge Distillation

Wei-Lung Mao, Chun-Chi Wang, Po-Heng Chou, Kai-Chun Liu, Yu Tsao

Main category: cs.LG

TL;DR: Proposed a multilayer mobile edge computing (MLMEC) framework with knowledge distillation for fall detection systems to balance accuracy and latency by distributing computation across edge devices and servers.

DetailsMotivation: Address challenges in fall detection systems including limited edge device model size, data transmission latency to cloud centers, and the need for real-time processing for aging population assistance.

Method: MLMEC splits architecture into stations with neural network models, uses knowledge distillation to improve front-end detection accuracy by leveraging high-power back-end stations, and transmits data to more robust stations when front-end detection is unreliable.

Result: Knowledge distillation improved accuracy by 11.65% on SisFall dataset and 2.78% on FallAllD dataset. MLMEC with KD reduced data latency rate by 54.15% on FallAllD and 46.67% on SisFall compared to MLMEC without KD.

Conclusion: The MLMEC fall detection system demonstrates improved accuracy and reduced latency, making it effective for real-time fall detection applications.

Abstract: The rising aging population has increased the importance of fall detection (FD) systems as an assistive technology, where deep learning techniques are widely applied to enhance accuracy. FD systems typically use edge devices (EDs) worn by individuals to collect real-time data, which are transmitted to a cloud center (CC) or processed locally. However, this architecture faces challenges such as a limited ED model size and data transmission latency to the CC. Mobile edge computing (MEC), which allows computations at MEC servers deployed between EDs and CC, has been explored to address these challenges. We propose a multilayer MEC (MLMEC) framework to balance accuracy and latency. The MLMEC splits the architecture into stations, each with a neural network model. If front-end equipment cannot detect falls reliably, data are transmitted to a station with more robust back-end computing. The knowledge distillation (KD) approach was employed to improve front-end detection accuracy by allowing high-power back-end stations to provide additional learning experiences, enhancing precision while reducing latency and processing loads. Simulation results demonstrate that the KD approach improved accuracy by 11.65% on the SisFall dataset and 2.78% on the FallAllD dataset. The MLMEC with KD also reduced the data latency rate by 54.15% on the FallAllD dataset and 46.67% on the SisFall dataset compared to the MLMEC without KD. In summary, the MLMEC FD system exhibits improved accuracy and reduced latency.

[760] Deep Domain Adaptation for Turbofan Engine Remaining Useful Life Prediction: Methodologies, Evaluation and Future Trends

Yucheng Wang, Mohamed Ragab, Yubo Hou, Zhenghua Chen, Min Wu, Xiaoli Li

Main category: cs.LG

TL;DR: This paper provides a comprehensive review of Domain Adaptation techniques for turbofan engine RUL prediction, introducing a novel taxonomy and evaluating methods to address challenges like limited data and distribution shifts.

DetailsMotivation: Turbofan engine RUL prediction is crucial for aviation safety but faces challenges with limited data and distribution shifts from varying operating conditions. Domain Adaptation offers a promising solution for knowledge transfer between domains.

Method: The paper introduces a novel taxonomy for DA techniques in turbofan engines organized into: methodology-based (how DA is applied), alignment-based (where distribution shifts occur), and problem-based (why adaptations are needed). It also evaluates selected DA techniques on turbofan datasets.

Result: The review provides a multidimensional view of DA approaches tailored to turbofan engine characteristics, offering practical insights for practitioners and identifying key challenges in the field.

Conclusion: The paper establishes a comprehensive framework for understanding DA in turbofan RUL prediction and identifies future research directions to advance more effective domain adaptation techniques for this critical application.

Abstract: Remaining Useful Life (RUL) prediction for turbofan engines plays a vital role in predictive maintenance, ensuring operational safety and efficiency in aviation. Although data-driven approaches using machine learning and deep learning have shown potential, they face challenges such as limited data and distribution shifts caused by varying operating conditions. Domain Adaptation (DA) has emerged as a promising solution, enabling knowledge transfer from source domains with abundant data to target domains with scarce data while mitigating distributional shifts. Given the unique properties of turbofan engines, such as complex operating conditions, high-dimensional sensor data, and slower-changing signals, it is essential to conduct a focused review of DA techniques specifically tailored to turbofan engines. To address this need, this paper provides a comprehensive review of DA solutions for turbofan engine RUL prediction, analyzing key methodologies, challenges, and recent advancements. A novel taxonomy tailored to turbofan engines is introduced, organizing approaches into methodology-based (how DA is applied), alignment-based (where distributional shifts occur due to operational variations), and problem-based (why certain adaptations are needed to address specific challenges). This taxonomy offers a multidimensional view that goes beyond traditional classifications by accounting for the distinctive characteristics of turbofan engine data and the standard process of applying DA techniques to this area. Additionally, we evaluate selected DA techniques on turbofan engine datasets, providing practical insights for practitioners and identifying key challenges. Future research directions are identified to guide the development of more effective DA techniques, advancing the state of RUL prediction for turbofan engines.

[761] Explore the Loss space with Hill-ADAM

Meenakshi Manikandan, Leilani Gilpin

Main category: cs.LG

TL;DR: Hill-ADAM is a deterministic optimizer designed to escape local minima by alternating between error minimization and maximization, enabling global minimum discovery.

DetailsMotivation: To overcome ADAM optimizer's limitations in escaping local minima due to stochastic gradient updates that often converge at the first minimum encountered.

Method: Derives analytical approximation of ADAM step size, identifies escape conditions, and implements alternating minimization-maximization cycles for deterministic exploration of loss space.

Result: Tested on 5 loss functions and 12 image color correction instances, demonstrating effective escape from local minima and global minimum discovery.

Conclusion: Hill-ADAM provides a deterministic approach to escape local minima and find global minima, addressing fundamental limitations of stochastic optimization methods like ADAM.

Abstract: This paper introduces Hill-ADAM. Hill-ADAM is an optimizer with its focus towards escaping local minima in prescribed loss landscapes to find the global minimum. Hill-ADAM escapes minima by deterministically exploring the state space. This eliminates uncertainty from random gradient updates in stochastic algorithms while seldom converging at the first minimum that visits. In the paper we first derive an analytical approximation of the ADAM Optimizer step size at a particular model state. From there define the primary condition determining ADAM limitations in escaping local minima. The proposed optimizer algorithm Hill-ADAM alternates between error minimization and maximization. It maximizes to escape the local minimum and minimizes again afterward. This alternation provides an overall exploration throughout the loss space. This allows the deduction of the global minimum’s state. Hill-ADAM was tested with 5 loss functions and 12 amber-saturated to cooler-shade image color correction instances.

[762] Neural Bayesian Filtering

Christopher Solinas, Radovan Haluska, David Sychrovsky, Finbarr Timbers, Nolan Bard, Michael Buro, Martin Schmid, Nathan R. Sturtevant, Michael Bowling

Main category: cs.LG

TL;DR: Neural Bayesian Filtering (NBF) is a method that maintains belief distributions in partially observable systems using latent embeddings and particle-style updates, combining classical filter efficiency with deep generative model expressiveness.

DetailsMotivation: To address the challenge of maintaining accurate belief distributions in partially observable systems while avoiding particle impoverishment and tracking multimodal beliefs efficiently.

Method: NBF learns latent representations of beliefs as embedding vectors, uses these embeddings to condition generative models for sampling, and performs particle-style updates in embedding space using observations and environment dynamics.

Result: NBF successfully tracks rapidly shifting, multimodal beliefs in state estimation tasks across three partially observable environments, mitigating particle impoverishment risk.

Conclusion: NBF effectively combines computational efficiency of classical filters with the expressiveness of deep generative models for robust belief tracking in partially observable systems.

Abstract: We present Neural Bayesian Filtering (NBF), an algorithm for maintaining distributions over hidden states, called beliefs, in partially observable systems. NBF is trained to find a good latent representation of the beliefs induced by a task. It maps beliefs to fixed-length embedding vectors, which condition generative models for sampling. During filtering, particle-style updates compute posteriors in this embedding space using incoming observations and the environment’s dynamics. NBF combines the computational efficiency of classical filters with the expressiveness of deep generative models - tracking rapidly shifting, multimodal beliefs while mitigating the risk of particle impoverishment. We validate NBF in state estimation tasks in three partially observable environments.

[763] Predicting Stock Price Movement with LLM-Enhanced Tweet Emotion Analysis

An Vuong, Susan Gauch

Main category: cs.LG

TL;DR: A deep learning framework that combines emotion features from tweets (enhanced by Llama 3.1-8B-Instruct) with historical stock prices using LSTM to predict next-day significant price movements, achieving up to 38.5% accuracy.

DetailsMotivation: Stock price prediction is challenging due to market volatility and investor sentiment. This paper aims to improve prediction accuracy by incorporating emotion analysis from social media data.

Method: Use Llama 3.1-8B-Instruct to preprocess tweet data, then extract emotion features using three methods: DistilRoBERTa classifier and two NRC lexicon-based approaches. Combine these features with historical stock prices to train an LSTM model.

Result: All three emotion analysis methods improved prediction accuracy compared to baseline (13.5%). DistilRoBERTa-based model performed best, with accuracy increasing from 23.6% to 38.5% when using LLaMA-enhanced emotion analysis on TSLA, AAPL, and AMZN stocks.

Conclusion: Using large language models to preprocess tweet content enhances emotion analysis effectiveness, which in turn improves the accuracy of predicting significant stock price movements.

Abstract: Accurately predicting short-term stock price movement remains a challenging task due to the market’s inherent volatility and sensitivity to investor sentiment. This paper discusses a deep learning framework that integrates emotion features extracted from tweet data with historical stock price information to forecast significant price changes on the following day. We utilize Meta’s Llama 3.1-8B-Instruct model to preprocess tweet data, thereby enhancing the quality of emotion features derived from three emotion analysis approaches: a transformer-based DistilRoBERTa classifier from the Hugging Face library and two lexicon-based methods using National Research Council Canada (NRC) resources. These features are combined with previous-day stock price data to train a Long Short-Term Memory (LSTM) model. Experimental results on TSLA, AAPL, and AMZN stocks show that all three emotion analysis methods improve the average accuracy for predicting significant price movements, compared to the baseline model using only historical stock prices, which yields an accuracy of 13.5%. The DistilRoBERTa-based stock prediction model achieves the best performance, with accuracy rising from 23.6% to 38.5% when using LLaMA-enhanced emotion analysis. These results demonstrate that using large language models to preprocess tweet content enhances the effectiveness of emotion analysis which in turn improves the accuracy of predicting significant stock price movements.

[764] From Theory to Practice: Evaluating Data Poisoning Attacks and Defenses in In-Context Learning on Social Media Health Discourse

Rabeya Amin Jhuma, Mostafa Mohaimen Akand Faisal

Main category: cs.LG

TL;DR: This study demonstrates that in-context learning in LLMs is vulnerable to data poisoning attacks in public health sentiment analysis, with minor perturbations causing up to 67% label flips, but spectral signature defense can effectively mitigate these attacks.

DetailsMotivation: To explore the practical vulnerabilities of in-context learning in real-world public health settings and extend theoretical poisoning studies to high-stakes applications like health discourse analysis.

Method: Used data poisoning attacks on HMPV tweets through synonym replacement, negation insertion, and randomized perturbations, then applied Spectral Signature Defense to filter poisoned examples.

Result: Minor manipulations caused major disruptions (67% sentiment label flips), but after defense, ICL accuracy stabilized at 46.7% and logistic regression validation achieved 100% accuracy.

Conclusion: ICL is fragile under poisoning attacks but spectral defenses can effectively preserve dataset integrity, making AI systems more reliable for health-related social media monitoring.

Abstract: This study explored how in-context learning (ICL) in large language models can be disrupted by data poisoning attacks in the setting of public health sentiment analysis. Using tweets of Human Metapneumovirus (HMPV), small adversarial perturbations such as synonym replacement, negation insertion, and randomized perturbation were introduced into the support examples. Even these minor manipulations caused major disruptions, with sentiment labels flipping in up to 67% of cases. To address this, a Spectral Signature Defense was applied, which filtered out poisoned examples while keeping the data’s meaning and sentiment intact. After defense, ICL accuracy remained steady at around 46.7%, and logistic regression validation reached 100% accuracy, showing that the defense successfully preserved the dataset’s integrity. Overall, the findings extend prior theoretical studies of ICL poisoning to a practical, high-stakes setting in public health discourse analysis, highlighting both the risks and potential defenses for robust LLM deployment. This study also highlights the fragility of ICL under attack and the value of spectral defenses in making AI systems more reliable for health-related social media monitoring.

[765] Implicit Models: Expressive Power Scales with Test-Time Compute

Jialin Liu, Lisang Ding, Stanley Osher, Wotao Yin

Main category: cs.LG

TL;DR: Implicit models use iterative computation with fixed parameters to achieve infinite-depth networks with constant memory training. The paper shows these models can match larger explicit networks by scaling expressive power with test-time compute.

DetailsMotivation: To understand why compact implicit models can match or exceed larger explicit networks when given more test-time compute, despite their parameter efficiency.

Method: Nonparametric analysis of expressive power, proving that simple implicit operators can progressively express more complex mappings through iteration, with expressive power scaling with test-time compute.

Result: Theoretical characterization shows implicit models can match richer function classes as iterations increase. Validated across image reconstruction, scientific computing, and operations research with improved solution quality and stability.

Conclusion: Implicit models can achieve performance comparable to larger explicit networks by leveraging test-time compute to progressively increase expressive power through iteration.

Abstract: Implicit models, an emerging model class, compute outputs by iterating a single parameter block to a fixed point. This architecture realizes an infinite-depth, weight-tied network that trains with constant memory, significantly reducing memory needs for the same level of performance compared to explicit models. While it is empirically known that these compact models can often match or even exceed larger explicit networks by allocating more test-time compute, the underlying mechanism remains poorly understood. We study this gap through a nonparametric analysis of expressive power. We provide a strict mathematical characterization, showing that a simple and regular implicit operator can, through iteration, progressively express more complex mappings. We prove that for a broad class of implicit models, this process lets the model’s expressive power scale with test-time compute, ultimately matching a much richer function class. The theory is validated across three domains: image reconstruction, scientific computing, and operations research, demonstrating that as test-time iterations increase, the complexity of the learned mapping rises, while the solution quality simultaneously improves and stabilizes.

[766] In-Vivo Training for Deep Brain Stimulation

Nicholas Carter, Arkaprava Gupta, Prateek Ganguli, Benedikt Dietrich, Vibhor Krishna, Samarjit Chakraborty

Main category: cs.LG

TL;DR: RL-based DBS approach using measurable brain activity instead of simulated biomarkers, achieving better PD biomarker suppression than clinical methods.

DetailsMotivation: Current RL-based DBS models rely on biomarkers only available in brain-on-chip simulations, not measurable in real patients.

Method: TD3-based RL agent trained on basal ganglia model, adapting stimulation frequency and amplitude using measurable brain activity.

Result: Greater suppression of PD biomarkers compared to modern clinical DBS implementations, using information measurable in real-world environments.

Conclusion: Enables training personalized RL agents for individual patient needs using clinically measurable data.

Abstract: Deep Brain Stimulation (DBS) is a highly effective treatment for Parkinson’s Disease (PD). Recent research uses reinforcement learning (RL) for DBS, with RL agents modulating the stimulation frequency and amplitude. But, these models rely on biomarkers that are not measurable in patients and are only present in brain-on-chip (BoC) simulations. In this work, we present an RL-based DBS approach that adapts these stimulation parameters according to brain activity measurable in vivo. Using a TD3 based RL agent trained on a model of the basal ganglia region of the brain, we see a greater suppression of biomarkers correlated with PD severity compared to modern clinical DBS implementations. Our agent outperforms the standard clinical approaches in suppressing PD biomarkers while relying on information that can be measured in a real world environment, thereby opening up the possibility of training personalized RL agents specific to individual patient needs.

[767] SAFA-SNN: Sparsity-Aware On-Device Few-Shot Class-Incremental Learning with Fast-Adaptive Structure of Spiking Neural Network

Huijing Zhang, Muyang Cao, Linshan Jiang, Xin Du, Di Yu, Changze Lv, Shuiguang Deng

Main category: cs.LG

TL;DR: SAFA-SNN is a spiking neural network method for on-device few-shot class-incremental learning that uses sparsity-conditioned neuronal dynamics to prevent catastrophic forgetting, zeroth-order optimization for gradient estimation, and subspace projection to enhance discriminability of new classes, achieving superior performance and lower energy consumption.

DetailsMotivation: Edge devices need continuous learning of novel classes while preserving data privacy, but face challenges with insufficient data samples and limited resources. Existing ANN-based FSCIL frameworks are constrained by device resources, while SNNs offer lower energy consumption and neuromorphic hardware compatibility.

Method: Proposes SAFA-SNN with three key components: sparsity-conditioned neuronal dynamics where most neurons remain stable while a subset stays active; zeroth-order optimization to handle spike non-differentiability; and subspace projection during incremental learning to enhance discriminability of new classes.

Result: Extensive experiments on CIFAR100, Mini-ImageNet, and three neuromorphic datasets show SAFA-SNN outperforms baseline methods, achieving at least 4.01% improvement at the last incremental session on Mini-ImageNet and 20% lower energy cost with practical implementation.

Conclusion: SAFA-SNN provides an effective SNN-based solution for on-device FSCIL that mitigates catastrophic forgetting, handles spike non-differentiability, and prevents overfitting to novel classes, demonstrating superior performance and energy efficiency compared to existing methods.

Abstract: Continuous learning of novel classes is crucial for edge devices to preserve data privacy and maintain reliable performance in dynamic environments. However, the scenario becomes particularly challenging when data samples are insufficient, requiring on-device few-shot class-incremental learning (FSCIL) to maintain consistent model performance. Although existing work has explored parameter-efficient FSCIL frameworks based on artificial neural networks (ANNs), their deployment is still fundamentally constrained by limited device resources. Inspired by neural mechanisms, Spiking neural networks (SNNs) process spatiotemporal information efficiently, offering lower energy consumption, greater biological plausibility, and compatibility with neuromorphic hardware than ANNs. In this work, we present an SNN-based method for On-Device FSCIL, i.e., Sparsity-Aware and Fast Adaptive SNN (SAFA-SNN). We first propose sparsity-conditioned neuronal dynamics, in which most neurons remain stable while a subset stays active, thereby mitigating catastrophic forgetting. To further cope with spike non-differentiability in gradient estimation, we employ zeroth-order optimization. Moreover, during incremental learning sessions, we enhance the discriminability of new classes through subspace projection, which alleviates overfitting to novel classes. Extensive experiments conducted on two standard benchmark datasets (CIFAR100 and Mini-ImageNet) and three neuromorphic datasets (CIFAR-10-DVS, DVS128gesture, and N-Caltech101) demonstrate that SAFA-SNN outperforms baseline methods, specifically achieving at least 4.01% improvement at the last incremental session on Mini-ImageNet and 20% lower energy cost over baseline methods with practical implementation.

[768] LLM-Guided Evolutionary Program Synthesis for Quasi-Monte Carlo Design

Amir Sadikov

Main category: cs.LG

TL;DR: Using LLM-guided evolutionary program synthesis to automate the discovery of high-quality quasi-Monte Carlo (QMC) constructions for low-discrepancy point sets and digital sequences.

DetailsMotivation: To solve long-standing QMC design problems by automating the construction of finite point sets with low star discrepancy and optimizing Sobol' direction numbers to minimize randomized QMC error on downstream integrands.

Method: A two-phase procedure combining constructive code proposals with iterative numerical refinement, using an LLM-guided evolutionary loop that mutates and selects code under task-specific fitness criteria.

Result: Rediscovered known optima in small 2D cases, set new best-known 2D benchmarks for N >= 40, matched most known 3D optima up to N <= 8, reported improved 3D benchmarks beyond, and achieved consistent reductions in rQMC mean-squared error for 32-dimensional option-pricing tasks compared to Joe-Kuo parameters.

Conclusion: LLM-driven evolutionary program synthesis can effectively automate the discovery of high-quality QMC constructions, recovering classical designs where optimal and improving them where finite-N structure matters.

Abstract: Low-discrepancy point sets and digital sequences underpin quasi-Monte Carlo (QMC) methods for high-dimensional integration. We cast two long-standing QMC design problems as program synthesis and solve them with an LLM-guided evolutionary loop that mutates and selects code under task-specific fitness: (i) constructing finite 2D/3D point sets with low star discrepancy, and (ii) choosing Sobol’ direction numbers that minimize randomized QMC error on downstream integrands. Our two-phase procedure combines constructive code proposals with iterative numerical refinement. On finite sets, we rediscover known optima in small 2D cases and set new best-known 2D benchmarks for N >= 40, while matching most known 3D optima up to the proven frontier (N <= 8) and reporting improved 3D benchmarks beyond. On digital sequences, evolving Sobol’ parameters yields consistent reductions in randomized quasi-Monte Carlo (rQMC) mean-squared error for several 32-dimensional option-pricing tasks relative to widely used Joe–Kuo parameters, while preserving extensibility to any sample size and compatibility with standard randomizations. Taken together, the results demonstrate that LLM-driven evolutionary program synthesis can automate the discovery of high-quality QMC constructions, recovering classical designs where they are optimal and improving them where finite-N structure matters. Data and code are available at https://github.com/hockeyguy123/openevolve-star-discrepancy.git.

[769] Optimising Battery Energy Storage System Trading via Energy Market Operator Price Forecast

Aymeric Fabre

Main category: cs.LG

TL;DR: This research develops a forecast-informed trading algorithm for battery energy storage systems (BESS) using AEMO price forecasts, benchmarking it against basic strategies and exploring machine learning enhancements.

DetailsMotivation: Grid volatility from renewables and market decentralization creates pressure for better trading strategies, but the practical value of abundant forecast data for real-world BESS trading decisions remains unexplored.

Method: Analyzes AEMO price forecast accuracy patterns based on time of day, forecast horizon, and regional variations to create a novel trading model, then benchmarks against basic algorithms and explores machine learning enhancements.

Result: Develops a forecast-driven BESS trading algorithm that optimizes arbitrage financial returns, with performance compared to basic trading strategies without forecast knowledge.

Conclusion: The research outcomes will inform future improvements in energy market trading models and promote more efficient BESS integration into market operations.

Abstract: In electricity markets around the world, the ability to anticipate price movements with precision can be the difference between profit and loss, especially for fast-acting assets like battery energy storage systems (BESS). As grid volatility increases due to renewables and market decentralisation, operators and forecasters alike face growing pressure to transform prediction into strategy. Yet while forecast data is abundant, especially in advanced markets like Australia’s National Electricity Market (NEM), its practical value in driving real-world BESS trading decisions remains largely unexplored. This thesis dives into that gap. This work addresses a key research question: Can the accuracy of the Australian Energy Market Operator (AEMO) energy price forecasts be systematically leveraged to develop a reliable and profitable battery energy storage system trading algorithm? Despite the availability of AEMO price forecasts, no existing framework evaluates their reliability or incorporates them into practical BESS trading strategies. By analysing patterns in forecast accuracy based on time of day, forecast horizon, and regional variations, this project creates a novel, forecast-informed BESS trading model to optimise arbitrage financial returns. The performance of this forecast-driven algorithm is benchmarked against a basic trading algorithm with no knowledge of forecast data. The study further explores the potential of machine learning techniques to predict future energy prices by enhancing AEMO forecasts to govern a more advanced trading strategy. The research outcomes will inform future improvements in energy market trading models and promote more efficient BESS integration into market operations.

[770] Does higher interpretability imply better utility? A Pairwise Analysis on Sparse Autoencoders

Xu Wang, Yan Hu, Benyou Wang, Difan Zou

Main category: cs.LG

TL;DR: Higher interpretability in Sparse Autoencoders (SAEs) does not guarantee better steering utility for LLMs. A weak positive correlation exists, but feature selection using Delta Token Confidence significantly improves steering performance while eliminating the correlation.

DetailsMotivation: To investigate whether higher interpretability in SAEs actually leads to better steering utility for LLMs, challenging the common assumption that interpretable features naturally enable effective model behavior steering.

Method: Trained 90 SAEs across three LLMs with various architectures and sparsity levels, evaluated interpretability and steering utility using SAEBench and AxBench, performed rank-agreement analysis via Kendall’s tau b, and proposed Delta Token Confidence feature selection criterion.

Result: Found only weak positive association (tau b ≈ 0.298) between interpretability and steering utility. Delta Token Confidence improved steering performance by 52.52% compared to current best methods, and eliminated correlation between interpretability and utility (tau b ≈ 0).

Conclusion: Interpretability is an insufficient proxy for steering performance in SAEs. The most effective steering features show divergence between interpretability and utility, with Delta Token Confidence being a better selection criterion.

Abstract: Sparse Autoencoders (SAEs) are widely used to steer large language models (LLMs), based on the assumption that their interpretable features naturally enable effective model behavior steering. Yet, a fundamental question remains unanswered: does higher interpretability indeed imply better steering utility? To answer this question, we train 90 SAEs across three LLMs (Gemma-2-2B, Qwen-2.5-3B, Gemma-2-9B), spanning five architectures and six sparsity levels, and evaluate their interpretability and steering utility based on SAEBench (arXiv:2501.12345) and AxBench (arXiv:2502.23456) respectively, and perform a rank-agreement analysis via Kendall’s rank coefficients (tau b). Our analysis reveals only a relatively weak positive association (tau b approx 0.298), indicating that interpretability is an insufficient proxy for steering performance. We conjecture the interpretability utility gap may stem from the selection of SAE features, as not all of them are equally effective for steering. To further find features that truly steer the behavior of LLMs, we propose a novel selection criterion called Delta Token Confidence, which measures how much amplifying a feature changes the next token distribution. We show that our method improves the steering performance of three LLMs by 52.52 percent compared to the current best output score based criterion (arXiv:2503.34567). Strikingly, after selecting features with high Delta Token Confidence, the correlation between interpretability and utility vanishes (tau b approx 0), and can even become negative. This further highlights the divergence between interpretability and utility for the most effective steering features.

[771] Operationalizing Data Minimization for Privacy-Preserving LLM Prompting

Jijie Zhou, Niloofar Mireshghallah, Tianshi Li

Main category: cs.LG

TL;DR: A framework for data minimization in LLMs that quantifies the least privacy-revealing disclosure while maintaining utility, with evaluation showing larger models tolerate stronger data minimization than smaller ones.

DetailsMotivation: Address privacy risks from users oversharing personal information with LLMs through memorization, personalization, and security breaches.

Method: Proposed framework with formal definitions for data minimization and priority-queue tree search to find optimal privacy-utility tradeoffs in privacy-ordered transformation space.

Result: Larger frontier LLMs (GPT-5) tolerate 85.7% redaction while smaller models (Qwen2.5-0.5B) only 19.3%; LLMs struggle to predict optimal minimization, showing bias toward abstraction and oversharing.

Conclusion: There’s both a privacy gap and capability gap - models lack awareness of what information they actually need to solve tasks, suggesting need for better minimization awareness.

Abstract: The rapid deployment of large language models (LLMs) in consumer applications has led to frequent exchanges of personal information. To obtain useful responses, users often share more than necessary, increasing privacy risks via memorization, context-based personalization, or security breaches. We present a framework to formally define and operationalize data minimization: for a given user prompt and response model, quantifying the least privacy-revealing disclosure that maintains utility, and we propose a priority-queue tree search to locate this optimal point within a privacy-ordered transformation space. We evaluated the framework on four datasets spanning open-ended conversations (ShareGPT, WildChat) and knowledge-intensive tasks with single-ground-truth answers (CaseHold, MedQA), quantifying achievable data minimization with nine LLMs as the response model. Our results demonstrate that larger frontier LLMs can tolerate stronger data minimization while maintaining task quality than smaller open-source models (85.7% redaction for GPT-5 vs. 19.3% for Qwen2.5-0.5B). By comparing with our search-derived benchmarks, we find that LLMs struggle to predict optimal data minimization directly, showing a bias toward abstraction that leads to oversharing. This suggests not just a privacy gap, but a capability gap: models may lack awareness of what information they actually need to solve a task.

[772] Token Hidden Reward: Steering Exploration-Exploitation in Group Relative Deep Reinforcement Learning

Wenlong Deng, Yi Ren, Yushu Li, Boying Gong, Danica J. Sutherland, Xiaoxiao Li, Christos Thrampoulidis

Main category: cs.LG

TL;DR: THR is a token-level metric that quantifies token influence on correct responses in RL-tuned LLMs, revealing that positive THR tokens favor exploitation while negative THR tokens enable exploration. A THR-guided reweighting algorithm can bias training toward either exploration or exploitation.

DetailsMotivation: To address the open problem of explicitly controlling exploration vs exploitation in reinforcement learning for large language models, as current methods don't provide fine-grained control over this trade-off.

Method: Introduce Token Hidden Reward (THR) metric that measures each token’s influence on correct response likelihood under GRPO, then develop a THR-guided reweighting algorithm that modulates learning signals to bias training toward exploitation (amplifying positive THR) or exploration (amplifying negative THR).

Result: The algorithm improves greedy-decoding accuracy when favoring exploitation and Pass@K accuracy when favoring exploration. It integrates with other RL objectives like GSPO and generalizes across architectures including Llama.

Conclusion: THR provides a principled, fine-grained mechanism for dynamically controlling exploration and exploitation in RL-tuned LLMs, offering new tools for targeted fine-tuning in reasoning applications.

Abstract: Reinforcement learning with verifiable rewards has significantly advanced the reasoning capabilities of large language models, yet how to explicitly steer training toward exploration or exploitation remains an open problem. We introduce Token Hidden Reward (THR), a token-level metric that quantifies each token’s influence on the likelihood of correct responses under Group Relative Policy Optimization (GRPO). We find that training dynamics are dominated by a small subset of tokens with high absolute THR values. Most interestingly, tokens with positive THR strengthen confidence in correct outputs, thus favoring exploitation, while tokens with negative THR preserve probability mass for alternative outputs, enabling exploration. This insight suggests a natural intervention: a THR-guided reweighting algorithm that modulates GRPO’s learning signals to explicitly bias training toward exploitation or exploration. We validate the efficacy of this algorithm on diverse math reasoning benchmarks. By amplifying tokens with positive THR value and weakening negative ones, our algorithm improves greedy-decoding accuracy, favoring exploitation. The reverse strategy yields consistent gains in Pass@K accuracy, favoring exploration. We further demonstrate that our algorithm integrates seamlessly with other RL objectives such as GSPO and generalizes across architectures including Llama. These findings establish THR as a principled and fine-grained mechanism for dynamically controlling exploration and exploitation in RL-tuned LLMs, providing new tools for targeted fine-tuning in reasoning-intensive applications.

[773] Towards Sampling Data Structures for Tensor Products in Turnstile Streams

Zhao Song, Shenghao Xie, Samson Zhou

Main category: cs.LG

TL;DR: The paper proposes an attention sampler using importance sampling to reduce computational burden in large-scale attention-based models, analyzing its theoretical efficiency and broad applicability.

DetailsMotivation: To address the computational challenges of large-scale attention-based models in AI, particularly inspired by classical ℓ₂ samplers and recent attention schemes in LLMs.

Method: Proposes an attention sampler definition using importance sampling methods in streaming setting, analyzing space and update time theoretically.

Result: Significantly reduces computational burden of traditional attention mechanisms while maintaining effectiveness.

Conclusion: The attention sampler framework is scalable and broadly applicable across various model architectures and domains, offering efficient computational solutions.

Abstract: This paper studies the computational challenges of large-scale attention-based models in artificial intelligence by utilizing importance sampling methods in the streaming setting. Inspired by the classical definition of the $\ell_2$ sampler and the recent progress of the attention scheme in Large Language Models (LLMs), we propose the definition of the attention sampler. Our approach significantly reduces the computational burden of traditional attention mechanisms. We analyze the effectiveness of the attention sampler from a theoretical perspective, including space and update time. Additionally, our framework exhibits scalability and broad applicability across various model architectures and domains.

[774] Group Policy Gradient

Junhua Chen, Zixi Zhang, Hantao Zhong, Rika Antonova

Main category: cs.LG

TL;DR: GPG is a critic-free policy gradient method that replaces learned value functions with group-based Monte Carlo advantage estimation, matching or outperforming PPO while being more computationally efficient.

DetailsMotivation: To eliminate the memory, compute, and hyperparameter costs associated with training critics in policy gradient methods like PPO, while maintaining performance.

Method: Uses group-based Monte Carlo advantage estimators instead of learned value functions, preserving PPO’s clipped-objective structure but removing critic training requirements.

Result: GPG matches or outperforms PPO on standard benchmarks, makes better use of parallel simulations, and achieves more efficient computational resource utilization.

Conclusion: GPG provides a viable critic-free alternative to PPO that maintains performance while reducing computational overhead and complexity.

Abstract: We introduce Group Policy Gradient (GPG), a family of critic-free policy-gradient estimators for general MDPs. Inspired by the success of GRPO’s approach in Reinforcement Learning from Human Feedback (RLHF), GPG replaces a learned value function with a group-based Monte Carlo advantage estimator, removing the memory, compute, and hyperparameter costs of training a critic while preserving PPO’s clipped-objective structure. We prove the consistency of the GPG estimator, analyze the bias-variance tradeoffs, and demonstrate empirically that GPG matches or outperforms PPO on standard benchmarks. GPG makes better use of parallel simulations, which, together with its critic-free design, results in more efficient use of computational resources than PPO.

[775] From Moments to Models: Graphon Mixture-Aware Mixup and Contrastive Learning

Ali Azizpour, Reza Ramezanpour, Ashutosh Sabharwal, Santiago Segarra

Main category: cs.LG

TL;DR: Proposes a unified framework for graph representation learning that explicitly models datasets as mixtures of underlying graph generative models (graphons), enabling model-aware clustering, data augmentation, and contrastive learning.

DetailsMotivation: Real-world graph datasets often contain mixtures of populations from different underlying distributions, but current graph learning methods like contrastive learning and Mixup typically ignore this mixture structure.

Method: Leverages graph moments (motif densities) to cluster graphs from the same generative model, enabling model-aware partitioning. Introduces graphon-mixture-aware mixup (GMAM) for data augmentation and model-adaptive contrastive learning (MGCL) with improved negative sampling.

Result: Achieves state-of-the-art performance: MGCL ranks top in unsupervised learning across 8 datasets, and GMAM outperforms existing methods in 6 out of 7 supervised learning datasets.

Conclusion: Explicitly modeling the mixture structure of graph datasets through graphon-based clustering significantly improves both unsupervised and supervised graph learning tasks by enabling more semantically valid augmentations and better contrastive learning objectives.

Abstract: Real-world graph datasets often consist of mixtures of populations, where graphs are generated from multiple distinct underlying distributions. However, modern representation learning approaches, such as graph contrastive learning (GCL) and augmentation methods like Mixup, typically overlook this mixture structure. In this work, we propose a unified framework that explicitly models data as a mixture of underlying probabilistic graph generative models represented by graphons. To characterize these graphons, we leverage graph moments (motif densities) to cluster graphs arising from the same model. This enables us to disentangle the mixture components and identify their distinct generative mechanisms. This model-aware partitioning benefits two key graph learning tasks: 1) It enables a graphon-mixture-aware mixup (GMAM), a data augmentation technique that interpolates in a semantically valid space guided by the estimated graphons, instead of assuming a single graphon per class. 2) For GCL, it enables model-adaptive and principled augmentations. Additionally, by introducing a new model-aware objective, our proposed approach (termed MGCL) improves negative sampling by restricting negatives to graphs from other models. We establish a key theoretical guarantee: a novel, tighter bound showing that graphs sampled from graphons with small cut distance will have similar motif densities with high probability. Extensive experiments on benchmark datasets demonstrate strong empirical performance. In unsupervised learning, MGCL achieves state-of-the-art results, obtaining the top average rank across eight datasets. In supervised learning, GMAM consistently outperforms existing strategies, achieving new state-of-the-art accuracy in 6 out of 7 datasets.

[776] REG: A Regularization Optimizer for Robust Training Dynamics

Zehua Liu, Han Wu, Xiaojin Fu, Shuqi Liu, Xiongwei Han, Tao Zhong, Mingxuan Yuan

Main category: cs.LG

TL;DR: REG optimizer replaces Muon’s matrix sign function with Row-and-Column-Scaling (RACS) operator for more stable LLM training while maintaining AdamW compatibility.

DetailsMotivation: Muon optimizer's reliance on matrix sign function causes training instability and incompatibility when fine-tuning models pre-trained with AdamW.

Method: Proposed REG optimizer uses Row-and-Column-Scaling (RACS) operator instead of Muon’s matrix sign function, providing less drastic regularization while maintaining theoretical grounding in matrix balancing.

Result: REG achieves superior performance and stability over AdamW, maintains consistency with AdamW training paradigm, and avoids performance degradation during fine-tuning that occurs with Muon.

Conclusion: REG optimizer provides a stable and compatible alternative to both AdamW and Muon, particularly effective for fine-tuning pre-trained models.

Abstract: Optimizers are crucial for the efficient training of Large Language Models (LLMs). While AdamW is the de facto standard, recent structure-aware optimizers like Muon have emerged, which regularize gradient updates by operating on entire weight matrices. The Muon optimizer balances the gradient updates along all the directions. However, Muon’s reliance on the matrix sign function can lead to training instability, exhibits incompatibility when fine-tuning models pre-trained with AdamW. To address these limitations, we propose \textbf{REG}, a novel optimizer that replaces Muon’s aggressive matrix sign operator with the Row-and-Column-Scaling (RACS) operator. Theoretically grounded in balancing a matrix, the RACS operator regularizes the update steps in a less drastic manner, making it simpler to implement and more compatible with established training dynamics. Through extensive empirical experiments on LLM training, we demonstrate that our REG optimizer not only achieves superior performance and stability over AdamW, but also maintains consistency with the AdamW training paradigm. This consistency is particularly evident during the fine-tuning stage, where REG optimizer avoids the performance degradation observed with Muon.

[777] Balancing Interpretability and Performance in Reinforcement Learning: An Adaptive Spectral Based Linear Approach

Qianxin Yi, Shao-Bo Lin, Jun Fan, Yao Wang

Main category: cs.LG

TL;DR: A spectral-based linear RL method that uses spectral filtering for interpretability while maintaining performance, with theoretical guarantees and empirical validation on real-world datasets.

DetailsMotivation: To bridge the gap between RL theory and practical decision making by designing an interpretability-oriented yet performance-enhanced RL approach, addressing the limitations of post-hoc explanations in current methods.

Method: Proposes a spectral-based linear RL method that extends ridge regression through a spectral filter function, with adaptive regularization parameter selection guided by bias-variance trade-off.

Result: The method achieves near-optimal bounds for parameter estimation and generalization error, and outperforms or matches existing baselines in decision quality on simulated environments and real-world datasets from Kuaishou and Taobao.

Conclusion: The approach successfully bridges RL theory and practical decision making, providing interpretability, accuracy, and adaptability in management contexts while enhancing user trust through interpretable decision-making processes.

Abstract: Reinforcement learning (RL) has been widely applied to sequential decision making, where interpretability and performance are both critical for practical adoption. Current approaches typically focus on performance and rely on post hoc explanations to account for interpretability. Different from these approaches, we focus on designing an interpretability-oriented yet performance-enhanced RL approach. Specifically, we propose a spectral based linear RL method that extends the ridge regression-based approach through a spectral filter function. The proposed method clarifies the role of regularization in controlling estimation error and further enables the design of an adaptive regularization parameter selection strategy guided by the bias-variance trade-off principle. Theoretical analysis establishes near-optimal bounds for both parameter estimation and generalization error. Extensive experiments on simulated environments and real-world datasets from Kuaishou and Taobao demonstrate that our method either outperforms or matches existing baselines in decision quality. We also conduct interpretability analyses to illustrate how the learned policies make decisions, thereby enhancing user trust. These results highlight the potential of our approach to bridge the gap between RL theory and practical decision making, providing interpretability, accuracy, and adaptability in management contexts.

[778] Optimizing Fine-Tuning through Advanced Initialization Strategies for Low-Rank Adaptation

Yongfu Xue

Main category: cs.LG

TL;DR: IniLoRA improves LoRA by initializing low-rank matrices to approximate original model weights, achieving better performance across models and tasks.

DetailsMotivation: LoRA's zero-product initialization limits its ability to effectively activate and leverage original model weights, creating a performance bottleneck.

Method: Propose IniLoRA with novel initialization strategy that initializes low-rank matrices to closely approximate original model weights, plus two variants (IniLoRA-α and IniLoRA-β) with distinct initialization methods.

Result: Experimental results show IniLoRA achieves better performance than LoRA across a range of models and tasks.

Conclusion: IniLoRA addresses LoRA’s initialization limitation and provides enhanced parameter-efficient fine-tuning performance.

Abstract: The rapid development of parameter-efficient fine-tuning methods has noticeably improved the efficiency of adapting large language models. Among these, LoRA has gained widespread popularity due to its strong balance of effectiveness and parameter efficiency. However, LoRA relies on initializing two low-rank matrices whose product is zero, which limits its ability to effectively activate and leverage the original model weights-creating a potential bottleneck for optimal performance. To address this limitation, we propose \textbf{IniLoRA}, a novel initialization strategy that initializes the low-rank matrices to closely approximate the original model weights. Experimental results indicate that IniLoRA achieves better performance than LoRA across a range of models and tasks. Additionally, we introduce two variants, IniLoRA-$\alpha$ and IniLoRA-$\beta$, both leveraging distinct initialization methods to enhance performance further.

[779] Personalized federated prototype learning in mixed heterogeneous data scenarios

Jiahao Zeng, Wolong Xing, Liangtao Shi, Xin Huang, Jialin Wang, Zhile Cao, Zhenkui Shi

Main category: cs.LG

TL;DR: PFPL is a federated learning approach that addresses data heterogeneity by creating personalized unbiased prototypes for each client and using consistent regularization to improve model convergence while reducing communication costs.

DetailsMotivation: Conventional federated learning approaches often focus on isolated heterogeneous scenarios, leading to skewed feature or label distributions. However, data heterogeneity can actually improve model performance if properly leveraged.

Method: The PFPL method constructs personalized, unbiased prototypes for each client to provide richer domain knowledge and unbiased convergence targets. It introduces consistent regularization during local updates to align local instances with their personalized prototypes.

Result: Experimental results on Digits and Office Caltech datasets validate the effectiveness of PFPL, showing improved model performance and successfully reduced communication costs.

Conclusion: PFPL effectively addresses data heterogeneity in federated learning by leveraging personalized prototypes and consistent regularization, achieving better model performance while reducing communication overhead.

Abstract: Federated learning has received significant attention for its ability to simultaneously protect customer privacy and leverage distributed data from multiple devices for model training. However, conventional approaches often focus on isolated heterogeneous scenarios, resulting in skewed feature distributions or label distributions. Meanwhile, data heterogeneity is actually a key factor in improving model performance. To address this issue, we propose a new approach called PFPL in mixed heterogeneous scenarios. The method provides richer domain knowledge and unbiased convergence targets by constructing personalized, unbiased prototypes for each client. Moreover, in the local update phase, we introduce consistent regularization to align local instances with their personalized prototypes, which significantly improves the convergence of the loss function. Experimental results on Digits and Office Caltech datasets validate the effectiveness of our approach and successfully reduce the communication cost.

[780] Cost Efficient Fairness Audit Under Partial Feedback

Nirjhar Das, Mohit Sharma, Praharsh Nanavati, Kirankumar Shiragur, Amit Deshpande

Main category: cs.LG

TL;DR: The paper proposes novel cost-effective fairness audit algorithms for classifiers under partial feedback, where only positively classified individuals have observed labels. It addresses both black-box and mixture model settings, achieving significant cost reductions compared to baselines.

DetailsMotivation: Real-world fairness auditing faces challenges with partial feedback (only positive classifications have observed labels) and high costs for acquiring additional labeled data, such as in credit assessment and loan processing scenarios.

Method: Developed two approaches: (1) near-optimal black-box auditing algorithm under mild assumptions, and (2) novel mixture model algorithm that leverages learning from truncated samples and maximum-a-posteriori oracles, extending spherical Gaussian mixtures to exponential family mixtures.

Result: The algorithms significantly outperform natural baselines by approximately 50% in audit cost on real-world datasets (Adult Income and Law School), and work with popular fairness metrics including demographic parity, equal opportunity, and equalized odds.

Conclusion: The proposed auditing algorithms provide cost-effective solutions for fairness assessment under partial feedback, with the mixture model approach achieving particularly strong performance by leveraging distributional assumptions.

Abstract: We study the problem of auditing the fairness of a given classifier under partial feedback, where true labels are available only for positively classified individuals, (e.g., loan repayment outcomes are observed only for approved applicants). We introduce a novel cost model for acquiring additional labeled data, designed to more accurately reflect real-world costs such as credit assessment, loan processing, and potential defaults. Our goal is to find optimal fairness audit algorithms that are more cost-effective than random exploration and natural baselines. In our work, we consider two audit settings: a black-box model with no assumptions on the data distribution, and a mixture model, where features and true labels follow a mixture of exponential family distributions. In the black-box setting, we propose a near-optimal auditing algorithm under mild assumptions and show that a natural baseline can be strictly suboptimal. In the mixture model setting, we design a novel algorithm that achieves significantly lower audit cost than the black-box case. Our approach leverages prior work on learning from truncated samples and maximum-a-posteriori oracles, and extends known results on spherical Gaussian mixtures to handle exponential family mixtures, which may be of independent interest. Moreover, our algorithms apply to popular fairness metrics including demographic parity, equal opportunity, and equalized odds. Empirically, we demonstrate strong performance of our algorithms on real-world fair classification datasets like Adult Income and Law School, consistently outperforming natural baselines by around 50% in terms of audit cost.

[781] Unlocking Reasoning Capabilities in LLMs via Reinforcement Learning Exploration

Wenhao Deng, Long Wei, Chenglei Yu, Tailin Wu

Main category: cs.LG

TL;DR: RLVR improves LLM reasoning but suffers from diminishing returns with increased sampling due to reverse KL divergence limiting exploration. RAPO uses forward KL for out-of-distribution exploration and adaptive in-distribution exploration to overcome this limitation.

DetailsMotivation: Address the fundamental limitation of RLVR where performance gains diminish with increased sampling budget due to restricted exploration caused by reverse KL divergence regularizer.

Method: Propose RAPO algorithm that uses forward KL penalty for out-of-distribution exploration and reweights reference policy for adaptive in-distribution exploration, trained on SimpleRL-Zero dataset without supervised fine-tuning.

Result: RAPO-trained Qwen2.5-3B and 7B models show consistent performance improvements on AIME2024 and AIME2025 benchmarks, surpassing base model performance ceiling and solving previously intractable problems.

Conclusion: RAPO advances RLVR frontier for challenging reasoning tasks by enabling broader yet focused exploration that overcomes the limitations of reverse KL divergence.

Abstract: Reinforcement learning with verifiable rewards (RLVR) has recently enhanced the reasoning capabilities of large language models (LLMs), particularly for mathematical problem solving. However, a fundamental limitation remains: as the sampling budget increases, the advantage of RLVR-trained models over their pretrained bases often diminishes or even vanishes, revealing a strong dependence on the base model’s restricted search space. We attribute this phenomenon to the widespread use of the reverse Kullback-Leibler (KL) divergence regularizer, whose mode-seeking behavior keeps the policy trapped inside the base model’s support region and hampers wider exploration. To address this issue, we propose RAPO (Rewards-Aware Policy Optimization), an algorithm to promote broader yet focused exploration. Our method (i) utilizes the forward KL penalty to replace the reverse KL penalty for out-of-distribution exploration, and (ii) reweights the reference policy to facilitate adaptive in-distribution exploration. We train Qwen2.5-3B and 7B models with RAPO on the 8K SimpleRL-Zero dataset, without supervised fine-tuning, and evaluate them on AIME2024 and AIME2025. Results show that RAPO consistently improves problem-solving performance. Notably, RAPO enables models to surpass the base model’s performance ceiling and solves previously intractable problems, advancing the frontier of RLVR for challenging reasoning tasks.

[782] HydroFusion-LMF: Semi-Supervised Multi-Network Fusion with Large-Model Adaptation for Long-Term Daily Runoff Forecasting

Qianfei Fan, Jiayu Wei, Peijun Zhu, Wensheng Ye, Meie Fang

Main category: cs.LG

TL;DR: HydroFusion-LMF is a unified framework for daily runoff forecasting that combines learnable decomposition, multiple expert models, and hydrologic context-aware fusion to handle non-stationarity and improve accuracy.

DetailsMotivation: Accurate decade-scale daily runoff forecasting is challenging due to blending trends, seasonal cycles, regime shifts, and sparse extremes. Existing deep models target single facets and under-utilize unlabeled data, limiting regime adaptivity.

Method: Proposes a four-component framework: (1) learnable trend-seasonal-residual decomposition, (2) routing residuals through heterogeneous experts, (3) hydrologic context-aware fusion gate, and (4) semi-supervised multi-task objective with optional adapter layers.

Result: On a ~10-year daily dataset, achieves MSE 1.0128 and MAE 0.5818, improving the strongest baseline (DLinear) by 10.2%/10.3% and mean baseline by 24.6%/17.1%, with simultaneous MSE and MAE reductions.

Conclusion: The framework balances interpretability with performance, advancing label-efficient hydrologic forecasting under non-stationarity through explicit components and sparse gating.

Abstract: Accurate decade-scale daily runoff forecasting in small watersheds is difficult because signals blend drifting trends, multi-scale seasonal cycles, regime shifts, and sparse extremes. Prior deep models (DLinear, TimesNet, PatchTST, TiDE, Nonstationary Transformer, LSTNet, LSTM) usually target single facets and under-utilize unlabeled spans, limiting regime adaptivity. We propose HydroFusion-LMF, a unified framework that (i) performs a learnable trend-seasonal-residual decomposition to reduce non-stationarity, (ii) routes residuals through a compact heterogeneous expert set (linear refinement, frequency kernel, patch Transformer, recurrent memory, dynamically normalized attention), (iii) fuses expert outputs via a hydrologic context-aware gate conditioned on day-of-year phase, antecedent precipitation, local variance, flood indicators, and static basin attributes, and (iv) augments supervision with a semi-supervised multi-task objective (composite MSE/MAE + extreme emphasis + NSE/KGE, masked reconstruction, multi-scale contrastive alignment, augmentation consistency, variance-filtered pseudo-labeling). Optional adapter / LoRA layers inject a frozen foundation time-series encoder efficiently. On a ~10-year daily dataset HydroFusion-LMF attains MSE 1.0128 / MAE 0.5818, improving the strongest baseline (DLinear) by 10.2% / 10.3% and the mean baseline by 24.6% / 17.1%. We observe simultaneous MSE and MAE reductions relative to baselines. The framework balances interpretability (explicit components, sparse gating) with performance, advancing label-efficient hydrologic forecasting under non-stationarity.

[783] LLM Chemistry Estimation for Multi-LLM Recommendation

Huascar Sanchez, Briland Hitaj

Main category: cs.LG

TL;DR: LLM Chemistry is a framework that measures synergistic or antagonistic behaviors in multi-LLM collaborations, quantifying interaction dependencies to recommend optimal model ensembles.

DetailsMotivation: Existing multi-LLM approaches rely on implicit selection without analyzing whether collaborating models truly complement or conflict, lacking systematic measurement of collective performance beyond individual capabilities.

Method: The authors formalize LLM chemistry concept, propose algorithms to quantify interaction dependencies, and develop ensemble recommendation methods based on theoretical analysis of heterogeneous model profiles.

Result: Evaluation on classification, summarization, and program repair tasks shows task-dependent effects, with chemistry most evident under heterogeneous model profiles and influenced by task type, group size, and complexity.

Conclusion: LLM Chemistry serves as both a diagnostic factor in multi-LLM systems and a foundation for ensemble recommendation, establishing a systematic approach to understanding and optimizing model collaborations.

Abstract: Multi-LLM collaboration promises accurate, robust, and context-aware solutions, yet existing approaches rely on implicit selection and output assessment without analyzing whether collaborating models truly complement or conflict. We introduce LLM Chemistry – a framework that measures when LLM combinations exhibit synergistic or antagonistic behaviors that shape collective performance beyond individual capabilities. We formalize the notion of chemistry among LLMs, propose algorithms that quantify it by analyzing interaction dependencies, and recommend optimal model ensembles accordingly. Our theoretical analysis shows that chemistry among collaborating LLMs is most evident under heterogeneous model profiles, with its outcome impact shaped by task type, group size, and complexity. Evaluation on classification, summarization, and program repair tasks provides initial evidence for these task-dependent effects, thereby reinforcing our theoretical results. This establishes LLM Chemistry as both a diagnostic factor in multi-LLM systems and a foundation for ensemble recommendation.

[784] Neural Low-Discrepancy Sequences

Michael Etienne Van Huffel, Nathan Kirk, Makram Chahine, Daniela Rus, T. Konstantin Rusch

Main category: cs.LG

TL;DR: NeuroLDS is a machine learning framework that generates low-discrepancy sequences (LDS) by training neural networks to map indices to points, achieving lower discrepancy than classical methods across applications like numerical integration and robot motion planning.

DetailsMotivation: Traditional low-discrepancy constructions rely on abstract algebra and number theory, and existing machine learning approaches like MPMC can only generate point sets but not sequences where every prefix has low discrepancy, which is essential for many applications.

Method: A two-stage learning process: supervised approximation of classical LDS constructions followed by unsupervised fine-tuning to minimize prefix discrepancies using neural networks that map indices to points.

Result: NeuroLDS significantly outperforms all previous LDS constructions in discrepancy measures and demonstrates effectiveness in numerical integration, robot motion planning, and scientific machine learning applications.

Conclusion: The framework shows promise for broad applications and represents the first successful machine learning-based approach for generating low-discrepancy sequences.

Abstract: Low-discrepancy points are designed to efficiently fill the space in a uniform manner. This uniformity is highly advantageous in many problems in science and engineering, including in numerical integration, computer vision, machine perception, computer graphics, machine learning, and simulation. Whereas most previous low-discrepancy constructions rely on abstract algebra and number theory, Message-Passing Monte Carlo (MPMC) was recently introduced to exploit machine learning methods for generating point sets with lower discrepancy than previously possible. However, MPMC is limited to generating point sets and cannot be extended to low-discrepancy sequences (LDS), i.e., sequences of points in which every prefix has low discrepancy, a property essential for many applications. To address this limitation, we introduce Neural Low-Discrepancy Sequences ($NeuroLDS$), the first machine learning-based framework for generating LDS. Drawing inspiration from classical LDS, we train a neural network to map indices to points such that the resulting sequences exhibit minimal discrepancy across all prefixes. To this end, we deploy a two-stage learning process: supervised approximation of classical constructions followed by unsupervised fine-tuning to minimize prefix discrepancies. We demonstrate that $NeuroLDS$ outperforms all previous LDS constructions by a significant margin with respect to discrepancy measures. Moreover, we demonstrate the effectiveness of $NeuroLDS$ across diverse applications, including numerical integration, robot motion planning, and scientific machine learning. These results highlight the promise and broad significance of Neural Low-Discrepancy Sequences. Our code can be found at https://github.com/camail-official/neuro-lds.

[785] Principled and Tractable RL for Reasoning with Diffusion Language Models

Anthony Zhan

Main category: cs.LG

TL;DR: AGRPO is a principled RL algorithm for diffusion LLMs that achieves significant performance gains on math/reasoning tasks over baseline models and comparable RL methods.

DetailsMotivation: Diffusion LLMs have not benefited from modern post-training techniques like RL due to incompatibility with traditional algorithms and lack of theoretical grounding in existing approaches.

Method: Amortized Group Relative Policy Optimization (AGRPO) - a principled on-policy RL algorithm using Monte Carlo sampling to compute unbiased policy gradient estimates specifically designed for diffusion LLMs.

Result: Achieved up to +7.6% absolute gain on GSM8K, 3.8x performance on Countdown task over baseline, and 1.3x gains over comparable RL methods like diffu-GRPO, with persistent gains across different sampling steps.

Conclusion: Online RL algorithms can be successfully extended to diffusion LLMs in principled ways while maintaining theoretical soundness and practical effectiveness.

Abstract: Diffusion large language models (dLLMs) are a new paradigm of non-autoregressive language models that are trained to predict multiple tokens in parallel and generate text via iterative unmasking. Recent works have successfully pretrained dLLMs to parity with autoregressive LLMs at the 8B scale, but dLLMs have yet to benefit from modern post-training techniques, e.g. reinforcement learning (RL), that have proven effective for autoregressive models. Crucially, algorithms designed for traditional LLMs aren’t directly compatible with diffusion frameworks due to inherent differences in modeling assumptions. Moreover, existing attempts at dLLM post-training with RL rely on heuristic-based objectives with no theoretical grounding. In this work, we present Amortized Group Relative Policy Optimization (AGRPO), a principled on-policy RL algorithm designed specifically for dLLMs. AGRPO uses Monte Carlo sampling to compute an unbiased policy gradient estimate, making it the first tractable, faithful adaptation of policy gradient methods for dLLMs. We demonstrate AGRPO’s effectiveness on different math/reasoning tasks, a common setting for RL with LLMs, achieving up to +7.6% absolute gain on GSM8K and 3.8x performance on the Countdown task over the baseline LLaDA-8B-Instruct model and 1.3x performance gains over comparable RL methods such as diffu-GRPO. Furthermore, these gains persist across different numbers of sampling steps at inference time, achieving better tradeoffs between compute and performance. Our results demonstrate that online RL algorithms can be extended to diffusion LLMs in principled ways, maintaining both theoretical soundness and practical effectiveness.

[786] EvoEngineer: Mastering Automated CUDA Kernel Code Evolution with Large Language Models

Ping Guo, Chenyu Zhu, Siyuan Chen, Fei Liu, Xi Lin, Zhichao Lu, Qingfu Zhang

Main category: cs.LG

TL;DR: EvoEngineer is a systematic LLM-based framework for CUDA kernel optimization that achieves 2.72× average speedup with 69.8% code validity rate, outperforming existing methods.

DetailsMotivation: CUDA kernel optimization is critical for AI performance but suffers from fragmented approaches and unclear problem formulations. General-purpose LLM methods fail to meet strict correctness requirements.

Method: Formalized CUDA kernel optimization as a code optimization task, then developed EvoEngineer - a systematic LLM-based code evolution framework that balances performance and correctness.

Result: Achieved 2.72× average median speedup over baseline CUDA kernels with 69.8% code validity rate. Maximum speedup of 36.75× over PyTorch kernels, with highest speedup on 56% of operations achieving over 2× acceleration.

Conclusion: EvoEngineer provides a principled balance between performance and correctness in CUDA kernel optimization, demonstrating superior results compared to existing methods across 91 real-world kernels.

Abstract: CUDA kernel optimization has become a critical bottleneck for AI performance, as deep learning training and inference efficiency directly depends on highly optimized GPU kernels. Despite the promise of Large Language Models (LLMs) for automating kernel optimization, this field suffers from a fragmented ecosystem of isolated and incomparable approaches with unclear problem formulations. Furthermore, general-purpose LLM code evolution methods cannot meet strict correctness requirements of CUDA kernel optimization. We address these fundamental challenges by first formalizing CUDA kernel optimization as a code optimization task with a clear objective, constraints, and evaluation metrics. We then establish the first systematic LLM-based code evolution framework, EvoEngineer, that provides guidance for designing and adapting optimization strategies to achieve a balance between performance and correctness. Finally, we implement a kernel optimization system based on this framework and conduct extensive experiments on 91 real-world CUDA kernels. Our results demonstrate that EvoEngineer achieves a principled balance between performance and correctness, with the highest averaged median speedup of \textbf{2.72}$\times$ over baseline CUDA kernels and a code validity rate of \textbf{69.8}%, outperforming existing methods on both dimensions. Our method achieves a maximum speedup of \textbf{36.75}$\times$ among all operations over PyTorch kernels and delivers the highest speedup on \textbf{28} (\textbf{56.0%}) of 50 operations that achieve over \textbf{2$\times$} acceleration.

[787] What Scales in Cross-Entropy Scaling Law?

Junxi Yan, Zixi Wei, Jingtao Zhan, Qingyao Ai, Yiqun Liu

Main category: cs.LG

TL;DR: The cross-entropy scaling law breaks down at large scales. The paper decomposes cross-entropy into error-entropy, self-alignment, and confidence, finding only error-entropy follows robust power-law scaling.

DetailsMotivation: Recent evidence shows the cross-entropy scaling law fails at very large model scales, causing problems for LLM development. The authors hypothesize cross-entropy itself doesn't truly scale - only one hidden component does.

Method: Introduce a novel decomposition of cross-entropy into three parts: Error-Entropy, Self-Alignment, and Confidence. Conduct extensive experiments on multiple datasets with 32 models spanning five orders of magnitude in size.

Result: Only error-entropy follows robust power-law scaling, while self-alignment and confidence remain largely invariant. Error-entropy dominates cross-entropy in small models but diminishes proportionally as models grow larger.

Conclusion: The error-entropy scaling law provides a more accurate description of model behavior than cross-entropy scaling law, explaining why the latter appears accurate at small scales but fails at large ones.

Abstract: The cross-entropy scaling law has long served as a key tool for guiding the development of large language models. It shows that cross-entropy loss decreases in a predictable power-law rate as the model size increases. However, recent evidence indicates that this law breaks down at very large scales: the loss decreases more slowly than expected, which causes significant trouble for developing large language models. In this paper, we hypothesize that the root cause lies in the fact that cross-entropy itself does not truly scale; instead, only one of its hidden components does. To investigate this, we introduce a novel decomposition of cross-entropy into three parts: Error-Entropy, Self-Alignment, and Confidence. We show both theoretically and empirically that this decomposition precisely captures the training dynamics and optimization objectives. Through extensive experiments on multiple datasets and 32 models spanning five orders of magnitude in size, we find that only error-entropy follows a robust power-law scaling, while the other two terms remain largely invariant. Moreover, error-entropy constitutes the dominant share of cross-entropy in small models but diminishes in proportion as models grow larger. This explains why the cross-entropy scaling law appears accurate at small scales but fails at very large ones. Our findings establish the error-entropy scaling law as a more accurate description of model behavior. We believe it will have wide applications in the training, understanding, and future development of large language models.

[788] Merge and Guide: Unifying Model Merging and Guided Decoding for Controllable Multi-Objective Generation

Guofu Xie, Chen Zhang, Xiao Zhang, Yunsheng Shi, Ting Yao, Jun Xu

Main category: cs.LG

TL;DR: MAGE is a two-stage framework that combines model merging with guided decoding to improve controllable multi-objective generation, addressing compatibility issues and outperforming existing methods.

DetailsMotivation: Existing methods for controllable multi-objective generation are insufficient - merging-based approaches provide indirect control while decoding-based guidance requires aggregating multiple expert models with high space overhead and dependency on individual model capacity.

Method: Two-stage framework: Stage 1 dynamically constructs a robust base model by merging backbone models for multiple objectives; Stage 2 merges explicit and implicit value models into a unified guidance proxy to steer the base model’s decoding.

Result: Extensive experiments show superior controllability, Pareto-optimal performance, and enhanced adaptability compared to existing approaches. Validates Linear Mode Connectivity in value models and explores relationship between model merging and prediction ensembling.

Conclusion: MAGE framework effectively addresses compatibility problems in multi-objective generation, demonstrating improved performance through the combination of model merging and guided decoding.

Abstract: Adapting to diverse user needs at test time is a key challenge in controllable multi-objective generation. Existing methods are insufficient: merging-based approaches provide indirect, suboptimal control at the parameter level, often disregarding the impacts of multiple objectives. While decoding-based guidance is more direct, it typically requires aggregating logits from multiple expert models, incurring significant space overhead and relying heavily on individual model capacity. To address these issues, we introduce Merge-And-GuidE (MAGE), a two-stage framework that leverages model merging for guided decoding. We first identify a critical compatibility problem between the guidance and base models. In Stage 1, MAGE resolves this by dynamically constructing a more robust base model, merging a series of backbone models that account for multiple objectives. In Stage 2, we merge explicit and implicit value models into a unified guidance proxy, which then steers the decoding of the base model from Stage 1. Our analysis empirically validates Linear Mode Connectivity (LMC) in value models, explores the relationship between model merging and prediction ensembling, and demonstrates the enhanced controllability afforded by our approach. Extensive experiments show that our method outperforms existing approaches, achieving superior controllability, Pareto-optimal performance, and enhanced adaptability.

[789] Slow-Fast Policy Optimization: Reposition-Before-Update for LLM Reasoning

Ziyan Wang, Zheng Wang, Jie Fu, Xingwei Qu, Qi Cheng, Shengpu Tang, Minjia Zhang, Xiaoming Huo

Main category: cs.LG

TL;DR: SFPO is a reinforcement learning framework that improves reasoning in LLMs by using slow-fast policy optimization with reposition mechanisms to stabilize training and reduce rollouts.

DetailsMotivation: On-policy RL algorithms like GRPO suffer from noisy gradients and unstable updates during early training due to low-quality rollouts, leading to inefficient exploration.

Method: SFPO decomposes each training step into three stages: fast trajectory of inner steps, reposition mechanism to control off-policy drift, and slow correction, while preserving the original objective and rollout process.

Result: SFPO outperforms GRPO by up to 2.80 points on math reasoning benchmarks, achieves 4.93x fewer rollouts, and 4.19x reduction in wall-clock time to match GRPO’s best accuracy.

Conclusion: SFPO consistently improves stability, reduces rollouts, and accelerates convergence of reasoning RL training while being plug-compatible with existing policy-gradient pipelines.

Abstract: Reinforcement learning (RL) has become central to enhancing reasoning in large language models (LLMs). Yet on-policy algorithms such as Group Relative Policy Optimization (GRPO) often suffer in early training: noisy gradients from low-quality rollouts lead to unstable updates and inefficient exploration. We introduce Slow-Fast Policy Optimization (SFPO), a simple yet efficient framework to address these limitations via decomposing each step into three stages: a short fast trajectory of inner steps on the same batch, a reposition mechanism to control off-policy drift, and a final slow correction. This reposition-before-update design preserves the objective and rollout process unchanged, making SFPO plug-compatible with existing policy-gradient pipelines. Extensive experiments demonstrate that SFPO consistently improves stability, reduces rollouts, and accelerates convergence of reasoning RL training. Specifically, it outperforms GRPO by up to 2.80 points in average on math reasoning benchmarks. It also achieves up to 4.93\texttimes{} fewer rollouts and a 4.19\texttimes{} reduction in wall-clock time to match GRPO’s best accuracy.

[790] Allocation of Parameters in Transformers

Ruoxi Yu, Haotian Jiang, Jingpu Cheng, Penghao Yu, Qianxiao Li, Zhong Li

Main category: cs.LG

TL;DR: This paper analyzes how to optimally allocate attention heads and dimensions across Transformer layers to balance expressivity and efficiency, revealing saturation behavior in softmax activations and proposing principled allocation strategies.

DetailsMotivation: Transformers have achieved remarkable success but their theoretical foundation for model efficiency remains underexplored, particularly regarding optimal parameter allocation across layers.

Method: Mathematical analysis of early layers’ role in information extraction, theoretical characterization of head vs dimension trade-off under fixed parameter budget, and investigation of softmax activation saturation behavior.

Result: Discovered and proved saturation behavior where increasing head dimensions leads to diminishing returns, especially for long sequences, suggesting later layers can operate efficiently with reduced parameters.

Conclusion: Proposed principled strategies for allocating attention heads and dimensions across Transformer layers, providing theoretically-grounded insights for model efficiency in Transformer architectures.

Abstract: Transformers have achieved remarkable successes across a wide range of applications, yet the theoretical foundation of their model efficiency remains underexplored. In this work, we investigate how the model parameters – mainly attention heads and head dimensions – should be allocated across layers to balance expressivity and efficiency. We first provide mathematical analysis on the role of early layers in information extraction from an approximation perspective, with a theoretical characterization on the trade-off between the number of heads and head dimension under a fixed parameter budget. In addition, we uncover and prove the \emph{saturation} behavior of softmax activations: Continuously increasing head dimensions can lead to diminishing returns in learning errors, particularly for long sequences. Supported by both theory and experiments, this saturation pattern suggests that later layers can operate more efficiently with reduced parameters. Combining these insights, we propose principled strategies for allocating attention heads and dimensions across Transformers’ layers, shedding light on theoretically-grounded model efficiency of Transformer-based architectures.

[791] Beyond Next-Token Prediction: A Performance Characterization of Diffusion versus Autoregressive Language Models

Minseo Kim, Coleman Hooper, Aditya Tomar, Chenfeng Xu, Mehrdad Farajtabar, Michael W. Mahoney, Kurt Keutzer, Amir Gholami

Main category: cs.LG

TL;DR: This paper compares the performance characteristics of Autoregressive Language Models (ARMs) and Diffusion Language Models (DLMs), finding that while DLMs offer higher arithmetic intensity through parallel generation, they struggle with long contexts and batch inference compared to ARMs.

DetailsMotivation: To understand the performance trade-offs between ARMs (sequential token generation) and DLMs (parallel text generation) in large language models, as DLMs' performance implications relative to commonly deployed ARMs are not fully understood.

Method: Comprehensive performance study using both theoretical analysis and profiling data to characterize trade-offs between ARMs and DLMs, including analysis of DLMs with block-wise decoding and batched inference scenarios.

Result: DLMs exhibit higher arithmetic intensity than ARMs due to parallel sequence processing but fail to scale effectively to longer contexts. Block-wise decoding helps DLMs achieve increased arithmetic intensity while maintaining long-context scaling. ARMs show superior throughput in batched inference due to better parallelism across sequences.

Conclusion: DLMs show promise for parallel generation but require optimizations like reducing sampling steps and implementing block-wise decoding to compete with ARMs on latency and long-context performance, with ARMs maintaining advantages in batched inference scenarios.

Abstract: Large Language Models (LLMs) have achieved state-of-the-art performance on a broad range of Natural Language Processing (NLP) tasks, including document processing and coding. Autoregressive Language Models (ARMs), which generate tokens sequentially conditioned on all previous tokens, have been the predominant paradigm for LLMs. However, while these networks have achieved high accuracy across a range of downstream tasks, they exhibit low arithmetic intensity due to the inherent sequential dependency with next-token prediction. Recently, Diffusion Language Models (DLMs) have emerged as a promising alternative architecture. DLMs generate output text in parallel, breaking the limitations of sequential dependency. However, the performance implications of DLMs relative to commonly deployed ARMs are not fully understood. In this work, we present a comprehensive performance study analyzing the performance characteristics of ARMs and DLMs, using both theoretical analysis and profiling data to characterize the trade-offs between these approaches. We illustrate that although DLMs exhibit higher arithmetic intensity compared to ARMs because of their capability to utilize parallelism across sequence lengths, they fail to scale effectively to longer contexts. We then explore DLMs with block-wise decoding, outlining how this approach allows for increased arithmetic intensity, while still scaling well to long contexts (similar to ARMs). We also show interesting trade-offs for batched inference, where we find that ARMs exhibit superior throughput, as they benefit more from parallelism across sequences in the batch. Finally, we highlight opportunities for accelerating DLM inference, and, in particular, highlight the importance of reducing the number of sampling steps for allowing open-source DLMs to provide improved latency relative to ARMs.

[792] Robust Batched Bandits

Yunwen Guo, Yunlun Shu, Gongyi Zhuo, Tianyu Wang

Main category: cs.LG

TL;DR: Robust batched multi-armed bandit algorithms for heavy-tailed rewards in finite-arm and Lipschitz settings, revealing that heavier tails can reduce batch requirements in instance-independent regimes but not in instance-dependent settings.

DetailsMotivation: Real-world applications like clinical trials often have heavy-tailed reward distributions, but existing batched MAB research assumes light-tailed distributions, creating a gap between theory and practice.

Method: Proposed robust batched bandit algorithms specifically designed for heavy-tailed rewards in both finite-arm and Lipschitz-continuous settings.

Result: Discovered that in instance-independent and Lipschitz settings, heavier-tailed rewards require fewer batches for near-optimal regret, while in instance-dependent settings, batch requirements remain unchanged regardless of tail heaviness.

Conclusion: Tail heaviness affects batch complexity differently across settings - reducing batch needs in some cases while having no effect in others, providing important insights for practical bandit algorithm design.

Abstract: The batched multi-armed bandit (MAB) problem, in which rewards are collected in batches, is crucial for applications such as clinical trials. Existing research predominantly assumes light-tailed reward distributions, yet many real-world scenarios, including clinical outcomes, exhibit heavy-tailed characteristics. This paper bridges this gap by proposing robust batched bandit algorithms designed for heavy-tailed rewards, within both finite-arm and Lipschitz-continuous settings. We reveal a surprising phenomenon: in the instance-independent regime, as well as in the Lipschitz setting, heavier-tailed rewards necessitate a smaller number of batches to achieve near-optimal regret. In stark contrast, for the instance-dependent setting, the required number of batches to attain near-optimal regret remains invariant with respect to tail heaviness.

[793] Wave-PDE Nets: Trainable Wave-Equation Layers as an Alternative to Attention

Harshil Vejendla

Main category: cs.LG

TL;DR: Wave-PDE Nets use differentiable wave equation simulation as neural network layers, achieving Transformer-level performance with better efficiency (30% faster, 25% less memory) through spectral solvers.

DetailsMotivation: To provide an alternative to attention and first-order state-space models using oscillatory, global mechanisms based on physical wave propagation principles.

Method: Each layer propagates hidden states as continuous fields through trainable velocity and damping parameters, using symplectic spectral FFT-based solvers for O(n log n) efficiency.

Result: Matches or exceeds Transformer performance on language and vision benchmarks while reducing wall-clock time by 30% and peak memory by 25%.

Conclusion: Wave-PDE Nets offer computationally efficient and robust architecture with strong physical inductive bias, proven to be universal approximators.

Abstract: We introduce Wave-PDE Nets, a neural architecture whose elementary operation is a differentiable simulation of the second-order wave equation. Each layer propagates its hidden state as a continuous field through a medium with trainable spatial velocity c(x) and damping {\gamma}(x). A symplectic spectral solver based on FFTs realises this propagation in O(nlog n) time. This oscillatory, global mechanism provides a powerful alternative to attention and first-order state-space models. We prove that a single Wave-PDE layer is a universal approximator. On language and vision benchmarks, Wave-PDE Nets match or exceed Transformer performance while demonstrating superior practical efficiency, reducing wall-clock time by up to 30% and peak memory by 25%. Ablation studies confirm the critical role of symplectic integration and a spectral Laplacian for stability and performance. Visualizations of the learned physical parameters reveal that the model learns intuitive strategies for information propagation. These results position Wave-PDE Nets as a computationally efficient and robust architecture with a strong physical inductive bias.

[794] Curriculum-Augmented GFlowNets For mRNA Sequence Generation

Aya Laajil, Abduragim Shtanchaev, Sajan Muhammad, Eric Moulines, Salem Lahlou

Main category: cs.LG

TL;DR: CAGFN integrates curriculum learning with multi-objective GFlowNets to generate mRNA sequences, improving Pareto performance and biological plausibility while maintaining diversity.

DetailsMotivation: mRNA sequence design is challenging due to vast nucleotide combinations and multi-objective optimization requirements. Current GFlowNet approaches suffer from sparse rewards and long-horizon training difficulties.

Method: Curriculum-Augmented GFlowNets (CAGFN) with length-based curriculum that progressively adapts maximum sequence length, plus a new mRNA design environment for training models to generate plausible mRNA candidates.

Result: CAGFN improves Pareto performance and biological plausibility, reaches higher-quality solutions faster than GFlowNets without curriculum, and enables generalization to out-of-distribution sequences.

Conclusion: CAGFN provides an effective approach for mRNA therapeutic sequence design by combining curriculum learning with multi-objective GFlowNets, advancing the field of therapeutic sequence generation.

Abstract: Designing mRNA sequences is a major challenge in developing next-generation therapeutics, since it involves exploring a vast space of possible nucleotide combinations while optimizing sequence properties like stability, translation efficiency, and protein expression. While Generative Flow Networks are promising for this task, their training is hindered by sparse, long-horizon rewards and multi-objective trade-offs. We propose Curriculum-Augmented GFlowNets (CAGFN), which integrate curriculum learning with multi-objective GFlowNets to generate de novo mRNA sequences. CAGFN integrates a length-based curriculum that progressively adapts the maximum sequence length guiding exploration from easier to harder subproblems. We also provide a new mRNA design environment for GFlowNets which, given a target protein sequence and a combination of biological objectives, allows for the training of models that generate plausible mRNA candidates. This provides a biologically motivated setting for applying and advancing GFlowNets in therapeutic sequence design. On different mRNA design tasks, CAGFN improves Pareto performance and biological plausibility, while maintaining diversity. Moreover, CAGFN reaches higher-quality solutions faster than a GFlowNet trained with random sequence sampling (no curriculum), and enables generalization to out-of-distribution sequences.

[795] Detecting Invariant Manifolds in ReLU-Based RNNs

Lukas Eisenmann, Alena Brändle, Zahra Monfared, Daniel Durstewitz

Main category: cs.LG

TL;DR: A novel algorithm for detecting stable and unstable manifolds in piecewise-linear RNNs (PLRNNs) is introduced, enabling characterization of multistability and chaos, with applications to understanding neural dynamics.

DetailsMotivation: Understanding why and how trained RNNs produce their behavior is important for scientific and medical applications, and explainable AI more generally. The dynamical repertoire depends on topological and geometrical properties of state space, particularly stable and unstable manifolds.

Method: A novel algorithm for detecting stable and unstable manifolds in piecewise-linear RNNs (PLRNNs) using ReLU activation functions. The method traces boundaries between basins of attraction and finds homoclinic points.

Result: The algorithm successfully characterizes multistability by tracing basin boundaries, establishes existence of chaos through homoclinic points, and provides insights into neural dynamics from electrophysiological recordings.

Conclusion: The introduced manifold detection algorithm enables systematic analysis of RNN dynamics, providing tools for understanding multistability and chaos in neural networks, with practical applications in neuroscience and explainable AI.

Abstract: Recurrent Neural Networks (RNNs) have found widespread applications in machine learning for time series prediction and dynamical systems reconstruction, and experienced a recent renaissance with improved training algorithms and architectural designs. Understanding why and how trained RNNs produce their behavior is important for scientific and medical applications, and explainable AI more generally. An RNN’s dynamical repertoire depends on the topological and geometrical properties of its state space. Stable and unstable manifolds of periodic points play a particularly important role: They dissect a dynamical system’s state space into different basins of attraction, and their intersections lead to chaotic dynamics with fractal geometry. Here we introduce a novel algorithm for detecting these manifolds, with a focus on piecewise-linear RNNs (PLRNNs) employing rectified linear units (ReLUs) as their activation function. We demonstrate how the algorithm can be used to trace the boundaries between different basins of attraction, and hence to characterize multistability, a computationally important property. We further show its utility in finding so-called homoclinic points, the intersections between stable and unstable manifolds, and thus establish the existence of chaos in PLRNNs. Finally we show for an empirical example, electrophysiological recordings from a cortical neuron, how insights into the underlying dynamics could be gained through our method.

[796] Partial Information Decomposition via Normalizing Flows in Latent Gaussian Distributions

Wenyuan Zhao, Adithya Balachandran, Chao Tian, Paul Pu Liang

Main category: cs.LG

TL;DR: A new efficient Gaussian Partial Information Decomposition (GPID) method using gradient-based optimization and information-preserving encoders for non-Gaussian data, improving accuracy and computational efficiency.

DetailsMotivation: Existing PID methods are computationally expensive and inaccurate for continuous high-dimensional modalities due to joint distribution optimization constraints.

Method: Proposed GPID with gradient-based algorithm for Gaussian pairwise distributions, plus information-preserving encoders to transform non-Gaussian data to Gaussian form.

Result: More accurate and efficient PID estimates than baselines in synthetic examples, validated on large-scale multimodal benchmarks for real-world applications.

Conclusion: The method successfully addresses computational limitations of PID, enabling practical multimodal analysis with improved efficiency and accuracy.

Abstract: The study of multimodality has garnered significant interest in fields where the analysis of interactions among multiple information sources can enhance predictive modeling, data fusion, and interpretability. Partial information decomposition (PID) has emerged as a useful information-theoretic framework to quantify the degree to which individual modalities independently, redundantly, or synergistically convey information about a target variable. However, existing PID methods depend on optimizing over a joint distribution constrained by estimated pairwise probability distributions, which are costly and inaccurate for continuous and high-dimensional modalities. Our first key insight is that the problem can be solved efficiently when the pairwise distributions are multivariate Gaussians, and we refer to this problem as Gaussian PID (GPID). We propose a new gradient-based algorithm that substantially improves the computational efficiency of GPID based on an alternative formulation of the underlying optimization problem. To generalize the applicability to non-Gaussian data, we learn information-preserving encoders to transform random variables of arbitrary input distributions into pairwise Gaussian random variables. Along the way, we resolved an open problem regarding the optimality of joint Gaussian solutions for GPID. Empirical validation in diverse synthetic examples demonstrates that our proposed method provides more accurate and efficient PID estimates than existing baselines. We further evaluate a series of large-scale multimodal benchmarks to show its utility in real-world applications of quantifying PID in multimodal datasets and selecting high-performing models.

[797] Proximal Diffusion Neural Sampler

Wei Guo, Jaemoo Choi, Yuchen Zhu, Molei Tao, Yongxin Chen

Main category: cs.LG

TL;DR: PDNS is a proximal diffusion neural sampler that addresses mode collapse in multimodal distributions by decomposing learning into simpler subproblems using proximal point method on path measures.

DetailsMotivation: Training neural samplers for multimodal distributions with significant barriers between modes can lead to mode collapse, making it challenging to explore all modes effectively.

Method: Proximal Diffusion Neural Sampler (PDNS) uses proximal point method on path measures to decompose learning into simpler subproblems, with each step using proximal weighted denoising cross-entropy (WDCE) objective.

Result: PDNS demonstrates effectiveness and robustness in extensive experiments on continuous and discrete sampling tasks, including challenging scenarios in molecular dynamics and statistical physics.

Conclusion: PDNS provides a staged procedure that creates progressively refined paths to target distributions, promoting thorough exploration across modes and addressing mode collapse issues.

Abstract: The task of learning a diffusion-based neural sampler for drawing samples from an unnormalized target distribution can be viewed as a stochastic optimal control problem on path measures. However, the training of neural samplers can be challenging when the target distribution is multimodal with significant barriers separating the modes, potentially leading to mode collapse. We propose a framework named \textbf{Proximal Diffusion Neural Sampler (PDNS)} that addresses these challenges by tackling the stochastic optimal control problem via proximal point method on the space of path measures. PDNS decomposes the learning process into a series of simpler subproblems that create a path gradually approaching the desired distribution. This staged procedure traces a progressively refined path to the desired distribution and promotes thorough exploration across modes. For a practical and efficient realization, we instantiate each proximal step with a proximal weighted denoising cross-entropy (WDCE) objective. We demonstrate the effectiveness and robustness of PDNS through extensive experiments on both continuous and discrete sampling tasks, including challenging scenarios in molecular dynamics and statistical physics.

[798] TROLL: Trust Regions improve Reinforcement Learning for Large Language Models

Philipp Becker, Niklas Freymuth, Serge Thilges, Fabian Otto, Gerhard Neumann

Main category: cs.LG

TL;DR: TROLL replaces PPO’s clipping mechanism with a discrete differentiable trust region projection for more stable and effective RL fine-tuning of LLMs.

DetailsMotivation: PPO's clipping mechanism is a crude approximation of KL-based trust regions that causes unstable updates and suboptimal performance in RL fine-tuning of LLMs.

Method: Uses a novel discrete differentiable trust region projection with token-level KL constraints, operating on sparse subsets of important token logits to balance computational cost and effectiveness.

Result: Consistently outperforms PPO-like clipping across datasets, model families, and advantage-estimation methods in training speed, stability, and final success rates.

Conclusion: TROLL serves as a direct replacement for PPO clipping during training while maintaining the same inference behavior, providing more principled and effective trust region optimization.

Abstract: On-policy Reinforcement Learning (RL) with PPO-like clip objectives has become the standard choice for reward-based fine-tuning of large language models (LLMs). Although recent work has explored improved estimators of advantages and normalization, the clipping mechanism itself has remained untouched. Originally introduced as a proxy for principled KL-based trust regions, clipping is a crude approximation that often causes unstable updates and suboptimal performance. We replace the clip objective with a novel discrete differentiable trust region projection, which provides principled token-level KL constraints. The projection operates on a sparse subset of the model’s most important token logits to balance computational cost and projection effectiveness. Our approach, Trust Region Optimization for Large Language Models (TROLL), serves as a direct replacement for PPO-like clipping during training and does not alter the model’s inference behavior. Across datasets, model families, and advantage-estimation methods, TROLL consistently outperforms PPO-like clipping in terms of training speed, stability, and final success rates.

[799] HOFLON: Hybrid Offline Learning and Online Optimization for Process Start-Up and Grade-Transition Control

Alex Durkin, Jasper Stolte, Mehmet Mercangöz

Main category: cs.LG

TL;DR: HOFLON is a hybrid offline reinforcement learning method that combines offline learning of data manifolds and Q-critics with online optimization to automate plant start-ups and grade-changes, outperforming both standard offline RL and human experts.

DetailsMotivation: Manual operation by expert operators for plant start-ups and grade-changes is becoming unsustainable due to workforce retirement, requiring automated solutions that can capture and surpass human expertise without process models.

Method: Hybrid approach: offline learning of latent data manifolds and long-horizon Q-critics from historical logs, combined with online optimization that maximizes Q-critic while penalizing deviations from learned manifold and excessive control changes.

Result: HOFLON outperformed Implicit Q-Learning and achieved better cumulative rewards than the best historical start-ups/grade-changes in both polymerization reactor and paper-machine case studies.

Conclusion: HOFLON demonstrates potential to automate transition operations beyond current expert capability by effectively overcoming distribution shift and value-overestimation issues in offline RL.

Abstract: Start-ups and product grade-changes are critical steps in continuous-process plant operation, because any misstep immediately affects product quality and drives operational losses. These transitions have long relied on manual operation by a handful of expert operators, but the progressive retirement of that workforce is leaving plant owners without the tacit know-how needed to execute them consistently. In the absence of a process model, offline reinforcement learning (RL) promises to capture and even surpass human expertise by mining historical start-up and grade-change logs, yet standard offline RL struggles with distribution shift and value-overestimation whenever a learned policy ventures outside the data envelope. We introduce HOFLON (Hybrid Offline Learning + Online Optimization) to overcome those limitations. Offline, HOFLON learns (i) a latent data manifold that represents the feasible region spanned by past transitions and (ii) a long-horizon Q-critic that predicts the cumulative reward from state-action pairs. Online, it solves a one-step optimization problem that maximizes the Q-critic while penalizing deviations from the learned manifold and excessive rates of change in the manipulated variables. We test HOFLON on two industrial case studies: a polymerization reactor start-up and a paper-machine grade-change problem, and benchmark it against Implicit Q-Learning (IQL), a leading offline-RL algorithm. In both plants HOFLON not only surpasses IQL but also delivers, on average, better cumulative rewards than the best start-up or grade-change observed in the historical data, demonstrating its potential to automate transition operations beyond current expert capability.

[800] LaDiR: Latent Diffusion Enhances LLMs for Text Reasoning

Haoqiang Kang, Yizhe Zhang, Nikki Lijing Kuang, Nicklas Majamaki, Navdeep Jaitly, Yi-An Ma, Lianhui Qin

Main category: cs.LG

TL;DR: LaDiR is a novel reasoning framework that combines continuous latent representations with latent diffusion models to enable iterative refinement of reasoning steps, overcoming limitations of autoregressive decoding in LLMs.

DetailsMotivation: LLMs' autoregressive decoding limits their ability to holistically revisit and refine earlier reasoning steps, leading to inefficient exploration of diverse solutions. The paper aims to address these limitations.

Method: Uses a VAE to encode text reasoning steps into blocks of thought tokens, then applies a latent diffusion model with blockwise bidirectional attention mask to denoise and iteratively refine reasoning trajectories in parallel.

Result: LaDiR consistently improves accuracy, diversity, and interpretability over existing methods on mathematical reasoning and planning benchmarks, enabling efficient parallel generation of diverse reasoning trajectories.

Conclusion: LaDiR reveals a new paradigm for text reasoning with latent diffusion, demonstrating superior performance in iterative refinement and holistic reasoning compared to autoregressive and other reasoning methods.

Abstract: Large Language Models (LLMs) demonstrate their reasoning ability through chain-of-thought (CoT) generation. However, LLM’s autoregressive decoding may limit the ability to revisit and refine earlier tokens in a holistic manner, which can also lead to inefficient exploration for diverse solutions. In this paper, we propose LaDiR (Latent Diffusion Reasoner), a novel reasoning framework that unifies the expressiveness of continuous latent representation with the iterative refinement capabilities of latent diffusion models for an existing LLM. We first construct a structured latent reasoning space using a Variational Autoencoder (VAE) that encodes text reasoning steps into blocks of thought tokens, preserving semantic information and interpretability while offering compact but expressive representations. Subsequently, we utilize a latent diffusion model that learns to denoise a block of latent thought tokens with a blockwise bidirectional attention mask, enabling longer horizon and iterative refinement with adaptive test-time compute. This design allows efficient parallel generation of diverse reasoning trajectories, allowing the model to plan and revise the reasoning process holistically. We conduct evaluations on a suite of mathematical reasoning and planning benchmarks. Empirical results show that LaDiR consistently improves accuracy, diversity, and interpretability over existing autoregressive, diffusion-based, and latent reasoning methods, revealing a new paradigm for text reasoning with latent diffusion.

[801] Technical note on Fisher Information for Robust Federated Cross-Validation

Behraj Khan, Tahir Qasim Syed

Main category: cs.LG

TL;DR: FIRE (Fisher Information for Robust fEderated validation) addresses performance degradation in fragmented data training by using Fisher information to estimate covariate shift divergences and apply per-fragment loss penalties for distribution alignment.

DetailsMotivation: When training data are fragmented across batches or federated-learned across different locations, models suffer performance degradation due to covariate shift from dissimilar empirical training distributions across fragments.

Method: Proposes FIRE method that accumulates fragmentation-induced covariate shift divergences from global training distribution via approximate Fisher information, used as per-fragment loss penalty for scalable distribution alignment.

Result: FIRE outperforms importance weighting benchmarks by up to 5.1% and federated learning benchmarks by up to 5.3% on shifted validation sets.

Conclusion: FIRE effectively addresses covariate shift in fragmented data scenarios through Fisher information-based distribution alignment, achieving significant performance improvements over existing methods.

Abstract: When training data are fragmented across batches or federated-learned across different geographic locations, trained models manifest performance degradation. That degradation partly owes to covariate shift induced by data having been fragmented across time and space and producing dissimilar empirical training distributions. Each fragment’s distribution is slightly different to a hypothetical unfragmented training distribution of covariates, and to the single validation distribution. To address this problem, we propose Fisher Information for Robust fEderated validation (\textbf{FIRE}). This method accumulates fragmentation-induced covariate shift divergences from the global training distribution via an approximate Fisher information. That term, which we prove to be a more computationally-tractable estimate, is then used as a per-fragment loss penalty, enabling scalable distribution alignment. FIRE outperforms importance weighting benchmarks by $5.1%$ at maximum and federated learning (FL) benchmarks by up to $5.3%$ on shifted validation sets.

[802] Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models

Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, Urmish Thakker, James Zou, Kunle Olukotun

Main category: cs.LG

TL;DR: ACE (Agentic Context Engineering) is a framework that treats contexts as evolving playbooks to prevent brevity bias and context collapse in LLM applications, achieving significant performance improvements on agent and domain-specific benchmarks.

DetailsMotivation: Address limitations of prior approaches that suffer from brevity bias (dropping domain insights for concise summaries) and context collapse (eroding details through iterative rewriting) in LLM context adaptation.

Method: A modular process of generation, reflection, and curation that treats contexts as evolving playbooks with structured, incremental updates to preserve detailed knowledge and scale with long-context models.

Result: +10.6% improvement on agents and +8.6% on finance benchmarks, significant reduction in adaptation latency and rollout cost, matches top-ranked production agent on AppWorld leaderboard overall and surpasses it on harder test-challenge split.

Conclusion: Comprehensive, evolving contexts enable scalable, efficient, and self-improving LLM systems with low overhead, effective adaptation without labeled supervision using natural execution feedback.

Abstract: Large language model (LLM) applications such as agents and domain-specific reasoning increasingly rely on context adaptation – modifying inputs with instructions, strategies, or evidence, rather than weight updates. Prior approaches improve usability but often suffer from brevity bias, which drops domain insights for concise summaries, and from context collapse, where iterative rewriting erodes details over time. Building on the adaptive memory introduced by Dynamic Cheatsheet, we introduce ACE (Agentic Context Engineering), a framework that treats contexts as evolving playbooks that accumulate, refine, and organize strategies through a modular process of generation, reflection, and curation. ACE prevents collapse with structured, incremental updates that preserve detailed knowledge and scale with long-context models. Across agent and domain-specific benchmarks, ACE optimizes contexts both offline (e.g., system prompts) and online (e.g., agent memory), consistently outperforming strong baselines: +10.6% on agents and +8.6% on finance, while significantly reducing adaptation latency and rollout cost. Notably, ACE could adapt effectively without labeled supervision and instead by leveraging natural execution feedback. On the AppWorld leaderboard, ACE matches the top-ranked production-level agent on the overall average and surpasses it on the harder test-challenge split, despite using a smaller open-source model. These results show that comprehensive, evolving contexts enable scalable, efficient, and self-improving LLM systems with low overhead.

[803] Technical note on Sequential Test-Time Adaptation via Martingale-Driven Fisher Prompting

Behraj Khan, Tahir Qasim Syed

Main category: cs.LG

TL;DR: M-FISHER is a method for sequential distribution shift detection and stable adaptation in streaming data with time-uniform false alarm guarantees and locally optimal adaptation updates.

DetailsMotivation: To address the challenge of detecting distribution shifts and adapting models in streaming data environments while maintaining statistical validity and stability.

Method: Uses exponential martingale from non-conformity scores with Ville’s inequality for detection, and Fisher-preconditioned prompt parameter updates for adaptation, implementing natural gradient descent.

Result: Provides time-uniform false alarm control, bounds detection delay as O(log(1/δ)/Γ), and achieves locally optimal updates that minimize KL divergence while preserving stability.

Conclusion: M-FISHER offers a principled approach for robust, anytime-valid detection and geometrically stable adaptation under sequential covariate shift.

Abstract: We present a theoretical framework for M-FISHER, a method for sequential distribution shift detection and stable adaptation in streaming data. For detection, we construct an exponential martingale from non-conformity scores and apply Ville’s inequality to obtain time-uniform guarantees on false alarm control, ensuring statistical validity at any stopping time. Under sustained shifts, we further bound the expected detection delay as $\mathcal{O}(\log(1/\delta)/\Gamma)$, where $\Gamma$ reflects the post-shift information gain, thereby linking detection efficiency to distributional divergence. For adaptation, we show that Fisher-preconditioned updates of prompt parameters implement natural gradient descent on the distributional manifold, yielding locally optimal updates that minimize KL divergence while preserving stability and parameterization invariance. Together, these results establish M-FISHER as a principled approach for robust, anytime-valid detection and geometrically stable adaptation in sequential decision-making under covariate shift.

[804] Optimal Scaling Needs Optimal Norm

Oleg Filatov, Jiangtao Wang, Jan Ebert, Stefan Kesselheim

Main category: cs.LG

TL;DR: The paper discovers that optimal hyperparameter scaling across model and dataset sizes is governed by a single invariant - the operator norm of the output layer, termed ’norm transfer’. This constant norm condition is necessary but not sufficient for optimal performance.

DetailsMotivation: To establish a unifying explanatory principle for optimal hyperparameter transfer under model and dataset scaling, as no such principle existed despite recent progress in the field.

Method: Used the Scion optimizer across models up to 1.3B parameters trained on up to 138B tokens, analyzed the optimal learning rate/batch size pairs, and measured scaling with dataset size. Also tuned per-layer-group learning rates.

Result: Found that joint optimal scaling is governed by a single invariant - the operator norm of the output layer. The optimal learning rate/batch size pair consistently has the same operator norm value across different scales. Also provided the first measurement of optimal hyperparameter scaling with dataset size for Scion.

Conclusion: The constant norm condition is necessary but not sufficient for optimal performance. Practical insights on norm-guided optimal scaling are provided, along with the release of Distributed Scion (Disco) implementation and logs from over 2000 runs to support LLM training dynamics research.

Abstract: Despite recent progress in optimal hyperparameter transfer under model and dataset scaling, no unifying explanatory principle has been established. Using the Scion optimizer, we discover that joint optimal scaling across model and dataset sizes is governed by a single invariant: the operator norm of the output layer. Across models with up to 1.3B parameters trained on up to 138B tokens, the optimal learning rate/batch size pair $(\eta^{\ast}, B^{\ast})$ consistently has the same operator norm value - a phenomenon we term norm transfer. This constant norm condition is necessary but not sufficient: while for each dataset size, multiple $(\eta, B)$ reach the optimal norm, only a unique $(\eta^{\ast}, B^{\ast})$ achieves the best loss. As a sufficient condition, we provide the first measurement of $(\eta^{\ast}, B^{\ast})$ scaling with dataset size for Scion, and find that the scaling rules are consistent with those of the Adam optimizer. Tuning per-layer-group learning rates also improves model performance, with the output layer being the most sensitive and hidden layers benefiting from lower learning rates. We provide practical insights on norm-guided optimal scaling and release our Distributed Scion (Disco) implementation with logs from over two thousand runs to support research on LLM training dynamics at scale.

[805] On Using Large Language Models to Enhance Clinically-Driven Missing Data Recovery Algorithms in Electronic Health Records

Sarah C. Lotspeich, Abbey Collins, Brian J. Wells, Ashish K. Khanna, Joseph Rigdon, Lucy D’Agostino McGowan

Main category: cs.LG

TL;DR: A roadmap-driven algorithm using ICD-10 codes and LLM-enhanced auxiliary diagnoses can effectively recover missing EHR data with accuracy comparable to expert chart reviews, offering a scalable solution for large patient populations.

DetailsMotivation: EHR data often contains missing values and errors, and manual chart reviews are expensive and time-intensive, limiting the number of patients that can be reviewed. There's a need for automated methods to recover missing data at scale.

Method: Developed a roadmap-driven algorithm using ICD-10 codes to mimic expert chart reviews. Iteratively refined the roadmap using large language models (LLM) combined with clinical expertise to expand the list of auxiliary diagnoses. Tested algorithm performance with different roadmaps on 100 patients and applied the final algorithm to 1000 patients.

Result: The algorithm recovered as much, if not more, missing data as expert chart reviewers, depending on the roadmap used. The LLM-enhanced roadmap with clinician-approved additions performed effectively in the larger study.

Conclusion: Clinically-driven algorithms enhanced by LLMs can recover missing EHR data with similar accuracy to chart reviews and can be feasibly applied to large patient samples. Future work should extend these methods to monitor other dimensions of data quality like plausibility.

Abstract: Objective: Electronic health records (EHR) data are prone to missingness and errors. Previously, we devised an “enriched” chart review protocol where a “roadmap” of auxiliary diagnoses (anchors) was used to recover missing values in EHR data (e.g., a diagnosis of impaired glycemic control might imply that a missing hemoglobin A1c value would be considered unhealthy). Still, chart reviews are expensive and time-intensive, which limits the number of patients whose data can be reviewed. Now, we investigate the accuracy and scalability of a roadmap-driven algorithm, based on ICD-10 codes (International Classification of Diseases, 10th revision), to mimic expert chart reviews and recover missing values. Materials and Methods: In addition to the clinicians’ original roadmap from our previous work, we consider new versions that were iteratively refined using large language models (LLM) in conjunction with clinical expertise to expand the list of auxiliary diagnoses. Using chart reviews for 100 patients from the EHR at an extensive learning health system, we examine algorithm performance with different roadmaps. Using the larger study of $1000$ patients, we applied the final algorithm, which used a roadmap with clinician-approved additions from the LLM. Results: The algorithm recovered as much, if not more, missing data as the expert chart reviewers, depending on the roadmap. Discussion: Clinically-driven algorithms (enhanced by LLM) can recover missing EHR data with similar accuracy to chart reviews and can feasibly be applied to large samples. Extending them to monitor other dimensions of data quality (e.g., plausability) is a promising future direction.

[806] On Provable Benefits of Muon in Federated Learning

Xinwen Zhang, Hongchang Gao

Main category: cs.LG

TL;DR: FedMuon adapts the Muon optimizer for federated learning, offering convergence guarantees for nonconvex problems with orthonormalized updates that are problem-agnostic and robust to heavy-tailed noise.

DetailsMotivation: Muon optimizer shows superior performance but its effectiveness in federated learning remains unexplored, creating a research gap.

Method: Proposed FedMuon algorithm with orthonormalized update direction and established convergence analysis for nonconvex problems.

Result: FedMuon achieves convergence with learning rate independent of problem-specific parameters and naturally accommodates heavy-tailed noise. Extensive experiments validate effectiveness across various neural network architectures.

Conclusion: FedMuon successfully adapts Muon to federated learning with theoretical guarantees and practical effectiveness, addressing the gap in federated optimization.

Abstract: The recently introduced optimizer, Muon, has gained increasing attention due to its superior performance across a wide range of applications. However, its effectiveness in federated learning remains unexplored. To address this gap, this paper investigates the performance of Muon in the federated learning setting. Specifically, we propose a new algorithm, FedMuon, and establish its convergence rate for nonconvex problems. Our theoretical analysis reveals multiple favorable properties of FedMuon. In particular, due to its orthonormalized update direction, the learning rate of FedMuon is independent of problem-specific parameters, and, importantly, it can naturally accommodate heavy-tailed noise. The extensive experiments on a variety of neural network architectures validate the effectiveness of the proposed algorithm.

[807] BONSAI: Structure-exploiting robust Bayesian optimization for networked black-box systems under uncertainty

Akshay Kudva, Joel A. Paulson

Main category: cs.LG

TL;DR: BONSAI is a new robust Bayesian optimization framework that leverages partial structural knowledge in simulation-based models, representing objectives as directed graphs of interconnected components to improve sample efficiency and solution quality in high-dimensional uncertainty-aware design.

DetailsMotivation: Traditional robust optimization methods require known problem structure and struggle with high-fidelity simulations, while existing robust Bayesian optimization methods ignore structural information and have scalability issues in high-dimensional settings.

Method: BONSAI represents objectives as directed graphs of interconnected white- and black-box components, uses a scalable Thompson sampling-based acquisition function tailored for structured robust optimization, and employs gradient-based optimization methods.

Result: BONSAI consistently delivers more sample-efficient and higher-quality robust solutions compared to existing simulation-based robust optimization algorithms across diverse synthetic and real-world case studies, including process systems engineering applications.

Conclusion: BONSAI offers practical advantages for uncertainty-aware design in complex engineering systems by effectively leveraging partial structural knowledge and enabling scalable robust optimization in high-dimensional settings.

Abstract: Optimal design under uncertainty remains a fundamental challenge in advancing reliable, next-generation process systems. Robust optimization (RO) offers a principled approach by safeguarding against worst-case scenarios across a range of uncertain parameters. However, traditional RO methods typically require known problem structure, which limits their applicability to high-fidelity simulation environments. To overcome these limitations, recent work has explored robust Bayesian optimization (RBO) as a flexible alternative that can accommodate expensive, black-box objectives. Existing RBO methods, however, generally ignore available structural information and struggle to scale to high-dimensional settings. In this work, we introduce BONSAI (Bayesian Optimization of Network Systems under uncertAInty), a new RBO framework that leverages partial structural knowledge commonly available in simulation-based models. Instead of treating the objective as a monolithic black box, BONSAI represents it as a directed graph of interconnected white- and black-box components, allowing the algorithm to utilize intermediate information within the optimization process. We further propose a scalable Thompson sampling-based acquisition function tailored to the structured RO setting, which can be efficiently optimized using gradient-based methods. We evaluate BONSAI across a diverse set of synthetic and real-world case studies, including applications in process systems engineering. Compared to existing simulation-based RO algorithms, BONSAI consistently delivers more sample-efficient and higher-quality robust solutions, highlighting its practical advantages for uncertainty-aware design in complex engineering systems.

[808] ONNX-Net: Towards Universal Representations and Instant Performance Prediction for Neural Architectures

Shiwen Qin, Alexander Auras, Shay B. Cohen, Elliot J. Crowley, Michael Moeller, Linus Ericsson, Jovita Lukasik

Main category: cs.LG

TL;DR: ONNX-Bench is a benchmark with 600k+ neural network architectures in ONNX format, enabling a universal text-based encoding (ONNX-Net) for performance prediction across diverse search spaces.

DetailsMotivation: To overcome limitations of existing NAS methods that are tied to specific cell-based search spaces and graph encodings, enabling more flexible and scalable neural architecture evaluation.

Method: Created ONNX-Bench benchmark with unified ONNX format, developed ONNX-Net text-based encoding using natural language descriptions, and trained performance predictors that can generalize across different search spaces.

Result: Achieved strong zero-shot performance across disparate search spaces using minimal pretraining samples, enabling instant evaluation of any neural network architecture.

Conclusion: ONNX-Bench provides a universal framework for neural architecture evaluation that transcends individual search space restrictions, offering unprecedented flexibility and scalability in NAS.

Abstract: Neural architecture search (NAS) automates the design process of high-performing architectures, but remains bottlenecked by expensive performance evaluation. Most existing studies that achieve faster evaluation are mostly tied to cell-based search spaces and graph encodings tailored to those individual search spaces, limiting their flexibility and scalability when applied to more expressive search spaces. In this work, we aim to close the gap of individual search space restrictions and search space dependent network representations. We present ONNX-Bench, a benchmark consisting of a collection of neural networks in a unified format based on ONNX files. ONNX-Bench includes all open-source NAS-bench-based neural networks, resulting in a total size of more than 600k {architecture, accuracy} pairs. This benchmark allows creating a shared neural network representation, ONNX-Net, able to represent any neural architecture using natural language descriptions acting as an input to a performance predictor. This text-based encoding can accommodate arbitrary layer types, operation parameters, and heterogeneous topologies, enabling a single surrogate to generalise across all neural architectures rather than being confined to cell-based search spaces. Experiments show strong zero-shot performance across disparate search spaces using only a small amount of pretraining samples, enabling the unprecedented ability to evaluate any neural network architecture instantly.

[809] On the Convergence and Size Transferability of Continuous-depth Graph Neural Networks

Mingsong Yan, Charles Kulick, Sui Tang

Main category: cs.LG

TL;DR: The paper provides a rigorous convergence analysis of Graph Neural Differential Equations (GNDEs) in the infinite-node limit, establishing their size transferability properties and deriving explicit convergence rates under different graph sampling regimes.

DetailsMotivation: To provide theoretical insights into the size transferability of continuous-depth graph neural networks (GNDEs) and justify the practical strategy of transferring models trained on moderate-sized graphs to larger graphs without retraining.

Method: Introduces Graphon Neural Differential Equations (Graphon-NDEs) as the infinite-node limit of GNDEs, uses graphon theory and dynamical systems tools to prove trajectory-wise convergence, and derives explicit convergence rates under deterministic graph sampling regimes.

Result: Establishes well-posedness of Graphon-NDEs, proves convergence of GNDE solutions to Graphon-NDE solutions, and derives explicit convergence rates for weighted graphs from smooth graphons and unweighted graphs from discontinuous graphons.

Conclusion: The theoretical analysis provides justification for size transferability of GNDE models, with numerical experiments on synthetic and real data supporting the theoretical findings.

Abstract: Continuous-depth graph neural networks, also known as Graph Neural Differential Equations (GNDEs), combine the structural inductive bias of Graph Neural Networks (GNNs) with the continuous-depth architecture of Neural ODEs, offering a scalable and principled framework for modeling dynamics on graphs. In this paper, we present a rigorous convergence analysis of GNDEs with time-varying parameters in the infinite-node limit, providing theoretical insights into their size transferability. To this end, we introduce Graphon Neural Differential Equations (Graphon-NDEs) as the infinite-node limit of GNDEs and establish their well-posedness. Leveraging tools from graphon theory and dynamical systems, we prove the trajectory-wise convergence of GNDE solutions to Graphon-NDE solutions. Moreover, we derive explicit convergence rates under two deterministic graph sampling regimes: (1) weighted graphs sampled from smooth graphons, and (2) unweighted graphs sampled from ${0,1}$-valued (discontinuous) graphons. We further establish size transferability bounds, providing theoretical justification for the practical strategy of transferring GNDE models trained on moderate-sized graphs to larger, structurally similar graphs without retraining. Numerical experiments using synthetic and real data support our theoretical findings.

[810] LLM as an Algorithmist: Enhancing Anomaly Detectors via Programmatic Synthesis

Hangting Ye, Jinmeng Li, He Zhao, Mingchen Zhuge, Dandan Guo, Yi Chang, Hongyuan Zha

Main category: cs.LG

TL;DR: LLM-DAS is a framework that uses LLMs to analyze detector weaknesses and generate Python code to synthesize hard-to-detect anomalies, enhancing detector robustness without exposing raw data.

DetailsMotivation: Existing anomaly detection methods have inconsistent real-world performance due to assumptions about anomaly patterns, and direct LLM application faces challenges with heterogeneous data processing and privacy risks.

Method: Repositions LLM as an ‘algorithmist’ to analyze detector descriptions, identify weaknesses, and generate detector-specific Python code for synthesizing hard-to-detect anomalies to augment training data.

Result: Extensive experiments on 36 TAD benchmarks show LLM-DAS consistently boosts performance of mainstream detectors, transforming the problem into more discriminative two-class classification.

Conclusion: LLM-DAS bridges LLM reasoning with classic AD algorithms via programmatic synthesis, offering scalable, effective, and privacy-preserving approach to patch logical blind spots of existing detectors.

Abstract: Existing anomaly detection (AD) methods for tabular data usually rely on some assumptions about anomaly patterns, leading to inconsistent performance in real-world scenarios. While Large Language Models (LLMs) show remarkable reasoning capabilities, their direct application to tabular AD is impeded by fundamental challenges, including difficulties in processing heterogeneous data and significant privacy risks. To address these limitations, we propose LLM-DAS, a novel framework that repositions the LLM from a data processor'' to an algorithmist’’. Instead of being exposed to raw data, our framework leverages the LLM’s ability to reason about algorithms. It analyzes a high-level description of a given detector to understand its intrinsic weaknesses and then generates detector-specific, data-agnostic Python code to synthesize ``hard-to-detect’’ anomalies that exploit these vulnerabilities. This generated synthesis program, which is reusable across diverse datasets, is then instantiated to augment training data, systematically enhancing the detector’s robustness by transforming the problem into a more discriminative two-class classification task. Extensive experiments on 36 TAD benchmarks show that LLM-DAS consistently boosts the performance of mainstream detectors. By bridging LLM reasoning with classic AD algorithms via programmatic synthesis, LLM-DAS offers a scalable, effective, and privacy-preserving approach to patching the logical blind spots of existing detectors.

[811] On Structured State-Space Duality

Jerry Yao-Chieh Hu, Xiwen Zhang, Weimin Wu, Han Liu

Main category: cs.LG

TL;DR: SSD establishes duality between diagonal SSMs and masked attention, showing equivalent sequence transformations can be implemented as O(T) recurrence or O(T²) attention, with limitations for softmax attention.

DetailsMotivation: To formalize and generalize the structured state-space duality (SSD) from scalar-identity SSMs to general diagonal SSMs, bridging recurrent models and attention mechanisms.

Method: Extend SSD to diagonal SSMs, prove training complexity bounds, establish necessary/sufficient conditions for SSM-attention equivalence, and analyze rank explosion in softmax attention.

Result: Diagonal SSMs match scalar case’s training complexity while supporting richer dynamics; identified conditions for SSM-attention equivalence; showed duality fails for softmax attention due to rank issues.

Conclusion: The work tightens connections between recurrent SSMs and Transformers, expanding design space for efficient sequence models with both recurrence and attention interpretations.

Abstract: Structured State-Space Duality (SSD) [Dao & Gu, ICML 2024] is an equivalence between a simple Structured State-Space Model (SSM) and a masked attention mechanism. In particular, a state-space model with a scalar-times-identity state matrix is equivalent to a masked self-attention with a $1$-semiseparable causal mask. Consequently, the same sequence transformation (model) has two algorithmic realizations: as a linear-time $O(T)$ recurrence or as a quadratic-time $O(T^2)$ attention. In this note, we formalize and generalize this duality: (i) we extend SSD from the scalar-identity case to general diagonal SSMs (diagonal state matrices); (ii) we show that these diagonal SSMs match the scalar case’s training complexity lower bounds while supporting richer dynamics; (iii) we establish a necessary and sufficient condition under which an SSM is equivalent to $1$-semiseparable masked attention; and (iv) we show that such duality fails to extend to standard softmax attention due to rank explosion. Together, these results tighten bridge between recurrent SSMs and Transformers, and widen the design space for expressive yet efficient sequence models.

[812] SPEAR: Soft Prompt Enhanced Anomaly Recognition for Time Series Data

Hanzhe Wei, Jiajun Wu, Jialin Yang, Henry Leung, Steve Drew

Main category: cs.LG

TL;DR: SPEAR uses soft prompts and quantization to adapt LLMs for time series anomaly detection, overcoming limitations of traditional methods with variable-length sequences and context-based anomalies.

DetailsMotivation: Traditional time series anomaly detection methods struggle with variable-length sequences and context-based anomalies. LLMs offer new opportunities but need adaptation for time series data.

Method: Quantize time series data into embeddings, combine with learnable soft prompt embeddings, feed into frozen LLM, and update soft prompts iteratively using cross-entropy loss.

Result: Soft prompts effectively increase LLMs’ performance in time series anomaly detection tasks, demonstrating successful adaptation of LLMs to time series data.

Conclusion: SPEAR successfully leverages LLMs for time series anomaly detection through soft prompts and quantization, providing an effective solution for handling variable-length sequences and context-based anomalies.

Abstract: Time series anomaly detection plays a crucial role in a wide range of fields, such as healthcare and internet traffic monitoring. The emergence of large language models (LLMs) offers new opportunities for detecting anomalies in the ubiquitous time series data. Traditional approaches struggle with variable-length time series sequences and context-based anomalies. We propose Soft Prompt Enhanced Anomaly Recognition (SPEAR), a novel approach to leverage LLMs for anomaly detection with soft prompts and quantization. Our methodology involves quantizing and transforming the time series data into input embeddings and combining them with learnable soft prompt embeddings. These combined embeddings are then fed into a frozen LLM. The soft prompts are updated iteratively based on a cross-entropy loss, allowing the model to adapt to time series anomaly detection. The use of soft prompts helps adapt LLMs effectively to time series tasks, while quantization ensures optimal handling of sequences, as LLMs are designed to handle discrete sequences. Our experimental results demonstrate that soft prompts effectively increase LLMs’ performance in downstream tasks regarding time series anomaly detection.

[813] THEMIS: Unlocking Pretrained Knowledge with Foundation Model Embeddings for Anomaly Detection in Time Series

Yadav Mahesh Lorik, Kaushik Sarveswaran, Nagaraj Sundaramahalingam, Aravindakumar Venugopalan

Main category: cs.LG

TL;DR: THEMIS is a new framework for time series anomaly detection that uses pretrained foundation model embeddings and outlier detection techniques to achieve state-of-the-art performance with interpretability.

DetailsMotivation: Time series anomaly detection faces challenges including seasonality, trends, concept drift, data imbalance, high dimensionality, real-time requirements, and interpretability needs, requiring flexible and robust approaches.

Method: THEMIS extracts embeddings from Chronos time series foundation model encoder and applies outlier detection techniques like Local Outlier Factor and Spectral Decomposition on self-similarity matrices.

Result: Achieves SOTA results on MSL dataset and competitive performance on SMAP and SWAT datasets, outperforming models specifically trained for anomaly detection with hyperparameter robustness.

Conclusion: Pretrained representations from foundation models enable efficient and adaptable time series anomaly detection with built-in interpretability.

Abstract: Time series anomaly detection forms a very crucial area in several domains but poses substantial challenges. Due to time series data possessing seasonality, trends, noise, and evolving patterns (concept drift), it becomes very difficult to set a general notion of what constitutes normal behavior. Anomalies themselves could be varied, ranging from a single outlier to contextual or collective anomalies, and are normally very rare; hence, the dataset is largely imbalanced. Additional layers of complexities arise due to the problems of increased dimensionality of modern time series, real-time detection criteria, setting up appropriate detection thresholds, and arriving at results that are interpretable. To embrace these multifaceted challenges, very strong, flexible, and interpretable approaches are required. This paper presents THEMIS, a new framework for time series anomaly detection that exploits pretrained knowledge from foundation models. THEMIS extracts embeddings from the encoder of the Chronos time series foundation model and applies outlier detection techniques like Local Outlier Factor and Spectral Decomposition on the self-similarity matrix, to spot anomalies in the data. Our experiments show that this modular method achieves SOTA results on the MSL dataset and performs quite competitively on the SMAP and SWAT$^*$ datasets. Notably, THEMIS exceeds models trained specifically for anomaly detection, presenting hyperparameter robustness and interpretability by default. This paper advocates for pretrained representations from foundation models for performing efficient and adaptable anomaly detection for time series data.

[814] Reinforce-Ada: An Adaptive Sampling Framework for Reinforce-Style LLM Training

Wei Xiong, Chenlu Ye, Baohao Liao, Hanze Dong, Xinxing Xu, Christof Monz, Jiang Bian, Nan Jiang, Tong Zhang

Main category: cs.LG

TL;DR: Reinforce-Ada is an adaptive sampling framework for online RL post-training of LLMs that dynamically reallocates sampling effort to prompts with greatest uncertainty, using online successive elimination and fixed-size groups with reward diversity to stabilize updates.

DetailsMotivation: Traditional RL for LLMs suffers from unstable gradient estimates due to fixed uniform sampling across prompts, which limits learning efficiency and reliability for reasoning tasks.

Method: Proposes an online adaptive sampling framework that interleaves estimation and sampling, uses successive elimination to stop sampling when sufficient signal is collected, forms fixed-size groups with enforced reward diversity, and computes advantage baselines using global statistics.

Result: Empirical results across multiple model architectures and reasoning benchmarks show accelerated convergence and improved final performance compared to GRPO, especially with balanced sampling variant.

Conclusion: Demonstrates the importance of variance-aware, adaptive data curation for efficient and reliable reinforcement learning in reasoning-capable LLMs.

Abstract: Reinforcement learning applied to large language models (LLMs) for reasoning tasks is often bottlenecked by unstable gradient estimates due to fixed and uniform sampling of responses across prompts. Prior work such as GVM-RAFT addresses this by dynamically allocating inference budget per prompt to minimize stochastic gradient variance under a budget constraint. Inspired by this insight, we propose Reinforce-Ada, an adaptive sampling framework for online RL post-training of LLMs that continuously reallocates sampling effort to the prompts with the greatest uncertainty or learning potential. Unlike conventional two-stage allocation methods, Reinforce-Ada interleaves estimation and sampling in an online successive elimination process, and automatically stops sampling for a prompt once sufficient signal is collected. To stabilize updates, we form fixed-size groups with enforced reward diversity and compute advantage baselines using global statistics aggregated over the adaptive sampling phase. Empirical results across multiple model architectures and reasoning benchmarks show that Reinforce-Ada accelerates convergence and improves final performance compared to GRPO, especially when using the balanced sampling variant. Our work highlights the central role of variance-aware, adaptive data curation in enabling efficient and reliable reinforcement learning for reasoning-capable LLMs. Code is available at https://github.com/RLHFlow/Reinforce-Ada.

[815] Generalized Fitted Q-Iteration with Clustered Data

Liyuan Hu, Jitao Wang, Zhenke Wu, Chengchun Shi

Main category: cs.LG

TL;DR: Proposes generalized fitted Q-iteration algorithm for reinforcement learning with clustered data, incorporating generalized estimating equations to handle intra-cluster correlations.

DetailsMotivation: Address reinforcement learning with clustered data commonly found in healthcare applications where standard methods fail to account for intra-cluster correlations.

Method: Develop generalized fitted Q-iteration algorithm that incorporates generalized estimating equations into policy learning to properly handle clustered data structure.

Result: Theoretical optimality when correlation structure is correctly specified, and consistency when mis-specified. Empirical results show 50% reduction in regret compared to standard FQI in simulations and mobile health dataset analysis.

Conclusion: Generalized FQI effectively handles clustered data in RL, significantly outperforming standard methods in healthcare applications.

Abstract: This paper focuses on reinforcement learning (RL) with clustered data, which is commonly encountered in healthcare applications. We propose a generalized fitted Q-iteration (FQI) algorithm that incorporates generalized estimating equations into policy learning to handle the intra-cluster correlations. Theoretically, we demonstrate (i) the optimalities of our Q-function and policy estimators when the correlation structure is correctly specified, and (ii) their consistencies when the structure is mis-specified. Empirically, through simulations and analyses of a mobile health dataset, we find the proposed generalized FQI achieves, on average, a half reduction in regret compared to the standard FQI.

[816] What Can You Do When You Have Zero Rewards During RL?

Jatin Prakash, Anirudh Buvanesh

Main category: cs.LG

TL;DR: RL with outcome-based rewards fails when base models never sample correct solutions (zero-reward barrier). Adding easier training samples enables solving hard tasks without algorithm changes.

DetailsMotivation: To overcome the zero-reward barrier in RL for LLMs where learning stalls when no correct solutions are sampled during training.

Method: Evaluated existing RL methods and introduced a simple data-centric intervention of adding easier samples to training sets, without modifying RL algorithms.

Result: Existing methods failed to overcome zero-reward barrier, but adding easier samples enabled models to eventually solve hard tasks despite starting from zero reward.

Conclusion: Data-centric interventions (easier samples) can overcome zero-reward barriers in RL for reasoning tasks, more effective than algorithm modifications alone.

Abstract: Reinforcement learning (RL) with outcome-based rewards has proven effective for improving large language models (LLMs) on complex reasoning tasks. However, its success often depends on the base model occasionally sampling correct solutions. When no correct solutions are sampled, training encounters a zero-reward barrier where learning stalls due to zero gradients. We study this scenario through the graph search task introduced in Bachmann et al. (2024) and evaluate recent methods that incorporate desirable components such as dense rewards, diversity incentives, and improved credit assignment. Our experiments show that none of these approaches overcome the zero-reward barrier if the base model never produces a correct answer. In contrast, we find that a simple data-centric intervention of adding easier samples to the training set enables the model to eventually solve the original hard task despite starting from zero reward. Importantly, this succeeds without modifying the RL algorithm itself. Because official implementations of several baselines were unavailable, we developed our own, which allowed us to conduct a detailed analysis of their failure modes. We release these implementations to support further research at: https://github.com/rl4reasoning/rl-baselines

[817] Transductive and Learning-Augmented Online Regression

Vinod Raman, Shenghao Xie, Samson Zhou

Main category: cs.LG

TL;DR: This paper studies online regression with access to predictions about future examples, establishing minimax regret bounds using fat-shattering dimension and developing algorithms that adapt to prediction quality.

DetailsMotivation: Real-life data streams often exhibit predictability, motivating the study of online regression where learners can leverage predictions about future examples to improve performance.

Method: The authors first analyze transductive online learning (where all examples are known in advance), then generalize to imperfect predictions. They use fat-shattering dimension to characterize minimax regret and develop adaptive online learners.

Result: They establish a separation between transductive and adversarial online regression, and develop an algorithm whose regret improves smoothly with prediction quality, matching worst-case bounds when predictions are poor and approaching transductive performance when predictions are accurate.

Conclusion: The work enables learnability for previously unlearnable classes under predictable examples, aligning with the learning-augmented model paradigm by leveraging predictions to improve online learning performance.

Abstract: Motivated by the predictable nature of real-life in data streams, we study online regression when the learner has access to predictions about future examples. In the extreme case, called transductive online learning, the sequence of examples is revealed to the learner before the game begins. For this setting, we fully characterize the minimax expected regret in terms of the fat-shattering dimension, establishing a separation between transductive online regression and (adversarial) online regression. Then, we generalize this setting by allowing for noisy or \emph{imperfect} predictions about future examples. Using our results for the transductive online setting, we develop an online learner whose minimax expected regret matches the worst-case regret, improves smoothly with prediction quality, and significantly outperforms the worst-case regret when future example predictions are precise, achieving performance similar to the transductive online learner. This enables learnability for previously unlearnable classes under predictable examples, aligning with the broader learning-augmented model paradigm.

[818] Distilling Reasoning into Student LLMs: Local Naturalness for Selecting Teacher Data

Hoang Anh Just, Myeongseob Ko, Ruoxi Jia

Main category: cs.LG

TL;DR: This paper introduces Local Naturalness for better response selection in multi-teacher reasoning distillation, overcoming limitations of global log-probability methods when dealing with long reasoning traces from multiple teachers.

DetailsMotivation: Current methods for response selection in reasoning distillation fail when multiple teacher outputs are available, as global naturalness no longer correlates with downstream performance, especially with long reasoning traces from strong teachers.

Method: Proposes Local Naturalness, which measures student’s log-probabilities over short sequential reasoning steps conditioned only on a small local window, enabling teacher selection and response selection from multiple teachers.

Result: Local Naturalness boosts a 32B student’s accuracy on math benchmarks by 9.4pp over global selection and surpasses performance achieved by training on data from the single best teacher.

Conclusion: Localized data quality evaluation and data mixing enable more effective reasoning distillation, highlighting the power of local naturalness over global approaches in multi-teacher settings.

Abstract: Distilling long reasoning traces (10K+ tokens) from stronger teacher models into smaller student LLMs via SFT has emerged as a standard paradigm. This approach is practical and efficient: it leverages the ease of generating abundant reasoning data from stronger models and provides a direct, data-driven way to teach less capable models better reasoning. While previous work has largely focused on prompt selection with responses from a single teacher, the equally important problem of choosing the best response when multiple teacher outputs are available for a single prompt remains underexplored. This challenge becomes important in a multi-teacher setting, where different students may benefit from the outputs of different teachers. This paper fills that gap with a systematic study of response selection for reasoning distillation. We first show that the current method, which picks responses the student assigns the highest global log-probability (global naturalness), fails when responses come from multiple teachers, i.e., global naturalness no longer correlates with downstream performance, especially as the reasoning traces from strong teachers become longer. To overcome this problem, we introduce Local Naturalness, which measures the student’s log-probabilities over short, sequential reasoning steps conditioned only on a small local window. Local Naturalness enables two applications: 1) Teacher Selection: Aggregating local scores across prompts reliably identifies the most helpful teacher. 2) Response Selection from a Multiple Teachers: When mixing answers from many teachers, Local Naturalness boosts a 32B student’s accuracy on math benchmarks by 9.4pp over global selection, also surpassing the performance achieved by training on data from the single best teacher. These results highlight the power of localized data quality evaluation and data mixing for more effective reasoning distillation.

[819] On the Empirical Power of Goodness-of-Fit Tests in Watermark Detection

Weiqing He, Xiang Li, Tianqi Shang, Li Shen, Weijie Su, Qi Long

Main category: cs.LG

TL;DR: Systematic evaluation shows that general goodness-of-fit tests can significantly improve the detection power and robustness of LLM watermark detectors, especially benefiting from text repetition patterns in low-temperature settings.

DetailsMotivation: LLMs raise authenticity concerns as they can generate human-like text at scale. Text watermarks provide provable origin verification, but current detection methods using goodness-of-fit tests remain underexplored despite their natural suitability for watermark detection.

Method: Systematically evaluated eight goodness-of-fit tests across three popular watermarking schemes using three open-source LLMs, two datasets, various generation temperatures, and multiple post-editing methods.

Result: General goodness-of-fit tests improve both detection power and robustness of watermark detectors. Text repetition in low-temperature settings gives GoF tests a unique advantage not exploited by existing methods.

Conclusion: Classic goodness-of-fit tests are a simple yet powerful and underused tool for watermark detection in LLMs, offering improved performance over existing approaches.

Abstract: Large language models (LLMs) raise concerns about content authenticity and integrity because they can generate human-like text at scale. Text watermarks, which embed detectable statistical signals into generated text, offer a provable way to verify content origin. Many detection methods rely on pivotal statistics that are i.i.d. under human-written text, making goodness-of-fit (GoF) tests a natural tool for watermark detection. However, GoF tests remain largely underexplored in this setting. In this paper, we systematically evaluate eight GoF tests across three popular watermarking schemes, using three open-source LLMs, two datasets, various generation temperatures, and multiple post-editing methods. We find that general GoF tests can improve both the detection power and robustness of watermark detectors. Notably, we observe that text repetition, common in low-temperature settings, gives GoF tests a unique advantage not exploited by existing methods. Our results highlight that classic GoF tests are a simple yet powerful and underused tool for watermark detection in LLMs.

[820] Learning to Interpret Weight Differences in Language Models

Avichal Goel, Yoon Kim, Nir Shavit, Tony T. Wang

Main category: cs.LG

TL;DR: Diff Interpretation Tuning (DIT) trains models to describe their own finetuning-induced weight changes using natural language, enabling interpretability of model modifications.

DetailsMotivation: Finetuning changes model weights but these changes are not interpretable, and finetuning datasets are often unavailable or too large to analyze directly.

Method: Uses synthetic, labeled weight diffs to train a DIT adapter that can be applied to finetuned models to generate natural language descriptions of their modifications.

Result: In proof-of-concept settings, models accurately describe their finetuning-induced modifications using natural language for reporting hidden behaviors and summarizing finetuned knowledge.

Conclusion: DIT enables comprehensive understanding of weight diffs through natural language descriptions, making model finetuning changes interpretable.

Abstract: Finetuning (pretrained) language models is a standard approach for updating their internal parametric knowledge and specializing them to new tasks and domains. However, the corresponding model weight changes (“weight diffs”) are not generally interpretable. While inspecting the finetuning dataset can give a sense of how the model might have changed, these datasets are often not publicly available or are too large to work with directly. Towards the goal of comprehensively understanding weight diffs in natural language, we introduce Diff Interpretation Tuning (DIT), a method that trains models to describe their own finetuning-induced modifications. Our approach uses synthetic, labeled weight diffs to train a DIT adapter, which can be applied to a compatible finetuned model to make it describe how it has changed. We demonstrate in two proof-of-concept settings (reporting hidden behaviors and summarizing finetuned knowledge) that our method enables models to describe their finetuning-induced modifications using accurate natural language descriptions.

[821] A Mathematical Explanation of Transformers for Large Language Models and GPTs

Xue-Cheng Tai, Hao Liu, Lingfeng Li, Raymond H. Chan

Main category: cs.LG

TL;DR: This paper proposes a continuous mathematical framework that interprets Transformers as discretizations of structured integro-differential equations, with self-attention as a non-local integral operator and layer normalization as time-dependent projection.

DetailsMotivation: To develop a comprehensive mathematical theory explaining Transformer architecture structure and operations, which currently lacks rigorous theoretical foundations despite its revolutionary impact on sequence modeling.

Method: Proposes a continuous framework interpreting Transformers as discretizations of structured integro-differential equations, with self-attention as non-local integral operators and layer normalization as projections to time-dependent constraints.

Result: Develops a unified operator-theoretic and variational perspective that provides interpretable foundations for understanding attention, feedforward layers, and normalization, embedding the entire Transformer operation in continuous domains.

Conclusion: This framework bridges deep learning architectures with continuous mathematical modeling, offering new directions for architecture design, analysis, and control-based interpretations, contributing to theoretically grounded neural network models.

Abstract: The Transformer architecture has revolutionized the field of sequence modeling and underpins the recent breakthroughs in large language models (LLMs). However, a comprehensive mathematical theory that explains its structure and operations remains elusive. In this work, we propose a novel continuous framework that rigorously interprets the Transformer as a discretization of a structured integro-differential equation. Within this formulation, the self-attention mechanism emerges naturally as a non-local integral operator, and layer normalization is characterized as a projection to a time-dependent constraint. This operator-theoretic and variational perspective offers a unified and interpretable foundation for understanding the architecture’s core components, including attention, feedforward layers, and normalization. Our approach extends beyond previous theoretical analyses by embedding the entire Transformer operation in continuous domains for both token indices and feature dimensions. This leads to a principled and flexible framework that not only deepens theoretical insight but also offers new directions for architecture design, analysis, and control-based interpretations. This new interpretation provides a step toward bridging the gap between deep learning architectures and continuous mathematical modeling, and contributes a foundational perspective to the ongoing development of interpretable and theoretically grounded neural network models.

[822] What Is The Performance Ceiling of My Classifier? Utilizing Category-Wise Influence Functions for Pareto Frontier Analysis

Shahriar Kabir Nahin, Wenxiao Xiao, Joshua Liu, Anshuman Chhabra, Hongfu Liu

Main category: cs.LG

TL;DR: The paper proposes category-wise influence functions and a linear programming-based framework to achieve Pareto improvements in model performance across all categories, rather than just overall accuracy.

DetailsMotivation: Most existing data-centric learning focuses on identifying beneficial data, but this paper investigates the fundamental question of what is the performance ceiling of learning models, emphasizing Pareto improvements where every class benefits without tradeoffs.

Method: Proposed category-wise influence functions and influence vectors to quantify training sample impact across all categories. Developed a principled criterion and linear programming-based sample reweighting framework for Pareto improvements.

Result: Extensive experiments on synthetic datasets, vision, and text benchmarks demonstrate effectiveness in estimating and achieving model performance improvement across multiple categories.

Conclusion: The approach successfully enables Pareto performance improvements across all categories, addressing limitations of traditional influence functions that focus only on overall accuracy.

Abstract: Data-centric learning seeks to improve model performance from the perspective of data quality, and has been drawing increasing attention in the machine learning community. Among its key tools, influence functions provide a powerful framework to quantify the impact of individual training samples on model predictions, enabling practitioners to identify detrimental samples and retrain models on a cleaner dataset for improved performance. However, most existing work focuses on the question: “what data benefits the learning model?” In this paper, we take a step further and investigate a more fundamental question: “what is the performance ceiling of the learning model?” Unlike prior studies that primarily measure improvement through overall accuracy, we emphasize category-wise accuracy and aim for Pareto improvements, ensuring that every class benefits, rather than allowing tradeoffs where some classes improve at the expense of others. To address this challenge, we propose category-wise influence functions and introduce an influence vector that quantifies the impact of each training sample across all categories. Leveraging these influence vectors, we develop a principled criterion to determine whether a model can still be improved, and further design a linear programming-based sample reweighting framework to achieve Pareto performance improvements. Through extensive experiments on synthetic datasets, vision, and text benchmarks, we demonstrate the effectiveness of our approach in estimating and achieving a model’s performance improvement across multiple categories of interest.

[823] From Noisy Traces to Stable Gradients: Bias-Variance Optimized Preference Optimization for Aligning Large Reasoning Models

Mingkang Zhu, Xi Chen, Bei Yu, Hengshuang Zhao, Jiaya Jia

Main category: cs.LG

TL;DR: BVPO is a preference optimization method for large reasoning models that reduces gradient variance by mixing trace-based and empty-trace estimators, improving alignment and reasoning performance.

DetailsMotivation: Aligning large reasoning models with human preferences is crucial for deployment, but current methods suffer from high gradient variance due to stochastic trace sampling during preference optimization.

Method: BVPO frames preference optimization through bias-variance trade-off, mixing high-variance trace-based estimator with low-variance empty-trace estimator, with closed-form optimal mixing weight.

Result: BVPO improves alignment by up to 7.8 points on AlpacaEval~2 and 6.8 points on Arena-Hard, and boosts reasoning performance by up to 4.0 points on math benchmarks despite training only on conversational data.

Conclusion: Trace sampling variance is a key bottleneck in LRM alignment, and directly optimizing bias-variance trade-off yields more stable training and stronger performance.

Abstract: Large reasoning models (LRMs) generate intermediate reasoning traces before producing final answers, yielding strong gains on multi-step and mathematical tasks. Yet aligning LRMs with human preferences, a crucial prerequisite for model deployment, remains underexplored. The statistically correct objective for preference alignment requires marginalizing over reasoning traces, but this computation is intractable in practice. A common workaround optimizes a single sampled trajectory, which introduces substantial gradient variance from stochastic trace sampling. To address this challenge, we frame preference optimization for LRMs through the lens of the bias–variance trade-off and propose Bias–Variance Optimized Preference Optimization (BVPO), a simple, drop-in method that mixes two gradient estimators: a high-variance trace-based estimator and a low-variance empty-trace estimator obtained by disabling reasoning trace generation. Our theory shows that BVPO strictly reduces trace-induced variance for any nontrivial mixture, provides a closed-form choice of the mixing weight that minimizes mean-squared error relative to the true marginal gradient, and under standard smoothness and step-size conditions, tightens classical convergence bounds for stochastic gradient descent. Empirically, BVPO improves alignment over the best baseline by up to 7.8 points on AlpacaEval~2 and 6.8 points on Arena-Hard. Despite being trained only on general conversational data, BVPO also boosts reasoning performance for base models by up to 4.0 points on the average of six math reasoning benchmarks. These results identify variance from trace sampling as a key bottleneck and demonstrate that directly optimizing the bias–variance trade-off yields more stable training and stronger overall performance.

[824] Optimizing Resources for On-the-Fly Label Estimation with Multiple Unknown Medical Experts

Tim Bary, Tiffanie Godelaine, Axel Abels, Benoît Macq

Main category: cs.LG

TL;DR: Proposes an adaptive real-time annotation method for medical screening that dynamically queries experts based on instance difficulty, reducing expert queries by up to 50% while maintaining accuracy.

DetailsMotivation: Existing algorithms don't meet requirements for seamless integration into screening pipelines that handle continuous data with initially unknown expert proficiency.

Method: Adaptive approach that incrementally gathers expert opinions until confidence threshold is met, supporting on-the-fly labeling without prior knowledge of experts or pre-labeled data.

Result: Reduces expert queries by up to 50% while achieving accuracy comparable to non-adaptive baseline on three multi-annotator classification datasets.

Conclusion: The method provides accurate labels with reduced annotation overhead, making it suitable for integration into medical screening workflows.

Abstract: Accurate ground truth estimation in medical screening programs often relies on coalitions of experts and peer second opinions. Algorithms that efficiently aggregate noisy annotations can enhance screening workflows, particularly when data arrive continuously and expert proficiency is initially unknown. However, existing algorithms do not meet the requirements for seamless integration into screening pipelines. We therefore propose an adaptive approach for real-time annotation that (I) supports on-the-fly labeling of incoming data, (II) operates without prior knowledge of medical experts or pre-labeled data, and (III) dynamically queries additional experts based on the latent difficulty of each instance. The method incrementally gathers expert opinions until a confidence threshold is met, providing accurate labels with reduced annotation overhead. We evaluate our approach on three multi-annotator classification datasets across different modalities. Results show that our adaptive querying strategy reduces the number of expert queries by up to 50% while achieving accuracy comparable to a non-adaptive baseline. Our code is available at https://github.com/tbary/MEDICS

[825] Early-Warning of Thunderstorm-Driven Power Outages with a Two-Stage Machine Learning Model

Iryna Stanishevska

Main category: cs.LG

TL;DR: A two-stage early-warning model for thunderstorm-related power outages using only open data sources, combining logistic gate and LSTM regressor to predict outages 24-48 hours in advance.

DetailsMotivation: Thunderstorm-driven outages are difficult to predict due to chaotic convective processes, sparse noisy data, and most storms not causing damage, requiring better early warning systems.

Method: Two-stage model with logistic gate and LSTM regressor using open data (EAGLE-I for outages, METAR for weather), kriging for spatial interpolation, and causal spatio-temporal features capturing severe convection precursors.

Result: Two-stage model detects more reference peaks (3/4 vs 2/4 at ±48h, F1 66.7% vs 57.1%) with modest amplitude gains near peaks (2-3% lower cMASE at ±0-12h) and comparable overall errors to baseline.

Conclusion: Despite open-data noise, the feature-driven pipeline provides actionable early warnings for thunderstorm outages, with SHAP analysis confirming the value of moisture-advection and wind/gust precursors.

Abstract: Thunderstorm-driven outages are difficult to predict because most storms do not cause damage, convective processes occur rapidly and chaotically, and the available public data are both noisy and incomplete. We develop a 24-48 h early-warning model for summer, thunderstorm-related outages in Michigan using only open sources (EAGLE-I for ground truth; METAR for weather). We use the publicly released EAGLE-I outage dataset (2014-2022), maintained by Oak Ridge National Laboratory for the U.S. Department of Energy. The pipeline preserves convective micro-signals from a sparse station network via parameter-specific kriging with hourly variograms and targeted overdrafting to retain extremes, and builds causal spatio-temporal features (lags/rolling statistics; k-NN/IDW spatial aggregates) capturing precursors of severe convection (moisture advection, wind shifts, and pressure drops). The two-stage model design, combining a logistic gate and an LSTM regressor, limits routine periods and reduces noise exposure. The study uses event-centric metrics (cluster-based hits/misses/false alarms) and peak-conditional MASE (cMASE) in +/-Delta-hour windows around state-level peaks (>= 50,000), with uncertainty quantified by hourly moving-block bootstrap. On the test sample, Two-Stage detects more reference peaks across all windows (e.g., at +/-48 h it records 3/4 vs. 2/4; F1 66.7% vs. 57.1%) with one extra false alarm. Near peaks, it shows modest amplitude gains (2-3% lower cMASE at +/-0-12 h; bootstrap medians +9-13% at +/-6-12 h) but small losses at +/-36-48 h (~3-4%). Overall, errors are comparable to the one-step LSTM baseline. SHAP analysis confirms moisture-advection and wind/gust precursors, underscoring the value of the feature engineering. Despite open-data noise, the feature-driven pipeline yields actionable, event-focused early warnings for thunderstorm outages.

[826] Beyond Softmax: A New Perspective on Gradient Bandits

Emerson Melo, David Müller

Main category: cs.LG

TL;DR: This paper establishes connections between discrete choice models and online learning/bandit theory, introducing new algorithms with sublinear regret bounds and generalized gradient methods that handle correlated actions.

DetailsMotivation: To bridge discrete choice models with online learning and multi-armed bandits, addressing limitations of existing methods like restrictive independence assumptions in softmax formulations.

Method: Developed a broad algorithmic family including Exp3 variants, adversarial bandit algorithms from generalized nested logit models, and novel generalized gradient bandit algorithms that relax independence assumptions.

Result: Achieved sublinear regret bounds, introduced flexible model specifications with computational efficiency via closed-form sampling probabilities, and demonstrated practical effectiveness in stochastic bandit settings through numerical experiments.

Conclusion: The proposed algorithms successfully combine flexible model specification with computational efficiency, extending gradient bandit methods beyond softmax formulations to handle correlated learning dynamics across actions.

Abstract: We establish a link between a class of discrete choice models and the theory of online learning and multi-armed bandits. Our contributions are: (i) sublinear regret bounds for a broad algorithmic family, encompassing Exp3 as a special case; (ii) a new class of adversarial bandit algorithms derived from generalized nested logit models \citep{wen:2001}; and (iii) \textcolor{black}{we introduce a novel class of generalized gradient bandit algorithms that extends beyond the widely used softmax formulation. By relaxing the restrictive independence assumptions inherent in softmax, our framework accommodates correlated learning dynamics across actions, thereby broadening the applicability of gradient bandit methods.} Overall, the proposed algorithms combine flexible model specification with computational efficiency via closed-form sampling probabilities. Numerical experiments in stochastic bandit settings demonstrate their practical effectiveness.

[827] Replacing Softmax Similarity with a Sharpened Angular Similarity: Theory and Practice of Scaling To Billion-Context Attention

Sahil Joshi, Agniva Chowdhury, Amar Kanakamedala, Ekam Singh, Evan Tu, Anshumali Shrivastava

Main category: cs.LG

TL;DR: RACE Attention is a linear-time alternative to Softmax Attention that uses sharpened angular similarity and randomized projections with soft LSH to handle extremely long contexts (up to 75M tokens) efficiently.

DetailsMotivation: Softmax Attention's quadratic complexity makes it impractical for very long contexts, with current implementations like FlashAttention failing beyond ~4M tokens on modern hardware.

Method: Replaces exponential kernel with sharpened angular (cosine) similarity, and approximates attention via randomized projections and soft Locality-Sensitive Hashing (LSH).

Result: Matches accuracy of strong baselines across language modeling, masked language modeling, and text classification while reducing runtime and memory. Processes up to 12M tokens on GPU and 75M tokens on CPU.

Conclusion: RACE Attention provides a practical, theoretically grounded solution for extremely long context windows on current hardware.

Abstract: Softmax Attention has a quadratic time complexity, which becomes prohibitive to run at long contexts, even with highly optimized GPU kernels. For example, FlashAttention (an exact, GPU-optimized implementation of Softmax Attention) cannot complete a single forward-backward pass of a multi-head attention layer once the context exceeds ~4 million tokens on an NVIDIA GH200 (96 GB). We introduce RACE Attention, a kernel-inspired alternative to Softmax Attention that is linear in sequence length and embedding dimension. RACE Attention replaces the exponential kernel with a sharpened angular (cosine) similarity, and approximates attention outputs via randomized projections and soft Locality-Sensitive Hashing (LSH). Across language modeling, masked language modeling, and text classification, RACE Attention matches the accuracy of strong baselines while reducing runtime and memory. In a controlled scale test, it processes up to 12 million tokens during a single forward-backward pass on an NVIDIA GH200 GPU and 75 million tokens on an Intel Xeon Gold 5220R CPU, well beyond the practical limits of the current state-of-the-art attention implementations. RACE Attention thus offers a practical, theoretically grounded mechanism for outrageously long context windows on today’s hardware. We hope that it gets adopted in practice.

[828] ICEPool: Enhancing Graph Pooling Networks with Inter-cluster Connectivity

Michael Yang

Main category: cs.LG

TL;DR: ICEPool is a hierarchical pooling framework that enhances inter-cluster connectivity understanding in graph neural networks, improving structural integrity preservation and boosting existing models’ performance.

DetailsMotivation: Existing hierarchical pooling models focus on cluster assignments but overlook relationships between clusters, limiting their ability to preserve structural integrity in graph data.

Method: ICEPool enhances inter-cluster connectivity by combining original model strengths with connectivity integration capabilities, using theoretical analysis for graph reconstruction validation.

Result: Experimental results demonstrate ICEPool’s compatibility with various models and its potential to boost performance of existing graph neural network architectures.

Conclusion: ICEPool effectively addresses the overlooked inter-cluster connectivity issue in hierarchical pooling models, providing a comprehensive framework that enhances graph-level representation and structural preservation.

Abstract: Hierarchical Pooling Models have demonstrated strong performance in classifying graph-structured data. While numerous innovative methods have been proposed to design cluster assignments and coarsening strategies, the relationships between clusters are often overlooked. In this paper, we introduce Inter-cluster Connectivity Enhancement Pooling (ICEPool), a novel hierarchical pooling framework designed to enhance model’s understanding of inter-cluster connectivity and ability of preserving the structural integrity in the original graph. ICEPool is compatible with a wide range of pooling-based GNN models. The deployment of ICEPool as an enhancement to existing models effectively combines the strengths of the original model with ICEPool’s capability to emphasize the integration of inter-cluster connectivity, resulting in a more comprehensive and robust graph-level representation. Moreover, we make theoretical analysis to ICEPool’s ability of graph reconstruction to demonstrate its effectiveness in learning inter-cluster relationship that is overlooked by conventional models. Finally, the experimental results show the compatibility of ICEPool with wide varieties of models and its potential to boost the performance of existing graph neural network architectures.

[829] Spatiotemporal Forecasting as Planning: A Model-Based Reinforcement Learning Approach with Generative World Models

Hao Wu, Yuan Gao, Xingjian Shi, Shuaipeng Li, Fan Xu, Fan Zhang, Zhihong Zhu, Weiyan Wang, Xiao Luo, Kun Wang, Xian Wu, Xiaomeng Huang

Main category: cs.LG

TL;DR: SFP introduces a Model-Based Reinforcement Learning approach for spatiotemporal forecasting that uses a Generative World Model to simulate future states, employs beam search planning with non-differentiable metrics as rewards, and iteratively self-trains the forecasting model using high-reward sequences as pseudo-labels.

DetailsMotivation: To overcome the challenges of inherent stochasticity and non-differentiable metrics in physical spatiotemporal forecasting, which traditional methods struggle with.

Method: Proposes Spatiotemporal Forecasting as Planning (SFP) with a Generative World Model for environmental simulation, beam search-based planning algorithm using non-differentiable metrics as rewards, and iterative self-training with high-reward sequences as pseudo-labels.

Result: Significantly reduces prediction error and demonstrates exceptional performance on critical domain metrics, particularly in capturing extreme events.

Conclusion: SFP provides an effective paradigm for spatiotemporal forecasting that successfully addresses stochasticity and non-differentiable metric challenges through reinforcement learning and iterative self-training.

Abstract: To address the dual challenges of inherent stochasticity and non-differentiable metrics in physical spatiotemporal forecasting, we propose Spatiotemporal Forecasting as Planning (SFP), a new paradigm grounded in Model-Based Reinforcement Learning. SFP constructs a novel Generative World Model to simulate diverse, high-fidelity future states, enabling an “imagination-based” environmental simulation. Within this framework, a base forecasting model acts as an agent, guided by a beam search-based planning algorithm that leverages non-differentiable domain metrics as reward signals to explore high-return future sequences. These identified high-reward candidates then serve as pseudo-labels to continuously optimize the agent’s policy through iterative self-training, significantly reducing prediction error and demonstrating exceptional performance on critical domain metrics like capturing extreme events.

[830] Incorporating Multivariate Consistency in ML-Based Weather Forecasting with Latent-space Constraints

Hang Fan, Yi Xiao, Yongquan Qu, Fenghua Ling, Ben Fei, Lei Bai, Pierre Gentine

Main category: cs.LG

TL;DR: This paper proposes a novel training approach for ML-based weather forecasting by treating model training as a weak-constraint 4D variational data assimilation problem, using latent-space constraints to improve long-term forecast skill and physical realism.

DetailsMotivation: Current ML-based weather forecast models treat reanalysis as perfect truth and use variable-specific loss weighting, ignoring physical coupling and spatial structure, leading to blurry and unrealistic forecasts over long time horizons.

Method: Reinterpret model training as WC-4DVar problem, compute loss in latent space learned by autoencoder where reanalysis error covariance becomes approximately diagonal, and extend framework to handle heterogeneous data sources.

Result: Rollout training with latent-space constraints improves long-term forecast skill, better preserves fine-scale structures and physical realism compared to model-space loss training.

Conclusion: The proposed framework enables more physically realistic ML weather forecasting by incorporating multivariate dependencies and can be extended to train models jointly on reanalysis and multi-source observations.

Abstract: Data-driven machine learning (ML) models have recently shown promise in surpassing traditional physics-based approaches for weather forecasting, leading to a so-called second revolution in weather forecasting. However, most ML-based forecast models treat reanalysis as the truth and are trained under variable-specific loss weighting, ignoring their physical coupling and spatial structure. Over long time horizons, the forecasts become blurry and physically unrealistic under rollout training. To address this, we reinterpret model training as a weak-constraint four-dimensional variational data assimilation (WC-4DVar) problem, treating reanalysis data as imperfect observations. This allows the loss function to incorporate reanalysis error covariance and capture multivariate dependencies. In practice, we compute the loss in a latent space learned by an autoencoder (AE), where the reanalysis error covariance becomes approximately diagonal, thus avoiding the need to explicitly model it in the high-dimensional model space. We show that rollout training with latent-space constraints improves long-term forecast skill and better preserves fine-scale structures and physical realism compared to training with model-space loss. Finally, we extend this framework to accommodate heterogeneous data sources, enabling the forecast model to be trained jointly on reanalysis and multi-source observations within a unified theoretical formulation.

[831] The Debate on RLVR Reasoning Capability Boundary: Shrinkage, Expansion, or Both? A Two-Stage Dynamic View

Xinhao Yao, Lu Yu, Xiaolin Hu, Fengwei Teng, Qing Cui, Jun Zhou, Yong Liu

Main category: cs.LG

TL;DR: RLVR training has two phases: exploitation (improves efficiency but shrinks capability boundaries) and exploration (expands boundaries through novel strategies). Both perspectives on RLVR effects are valid depending on training stage.

DetailsMotivation: To reconcile contradictory findings about whether RLVR expands or shrinks LLM reasoning capabilities by examining the underlying training dynamics.

Method: Theoretical and empirical analysis of two-stage probability mass dynamics in RLVR training: (1) Exploitation stage focusing on high-reward tokens, (2) Exploration stage where optimal tokens emerge.

Result: Both capability boundary shrinkage (during exploitation) and expansion (during exploration) occur. Over-exploitation causes shrinkage, while prolonged training enables expansion through optimal token discovery.

Conclusion: RLVR’s effects depend on training phase. Using relative negative gradients can prolong training to reach exploration stage, enabling advanced reasoning capability development.

Abstract: The ongoing debate on whether reinforcement learning with verifiable rewards (RLVR) expands or shrinks the reasoning capabilities of large language models (LLMs) remains unresolved. Some studies contend that RLVR mainly improves sampling efficiency but at the expense of diversity and exploratory capacity, resulting in capability boundary shrinkage. In contrast, others demonstrate that prolonged training can lead to the emergence of novel reasoning strategies, suggesting capability boundary expansion. To reconcile these contradictory findings, we theoretically and empirically show that both perspectives are partially valid-each aligning with a separate phase in an inherent two-stage probability mass dynamic: (1) Exploitation stage: initially, the model primarily samples explored high-reward and low-reward tokens, while rarely selecting the potentially optimal token. Positive advantage estimates increase the probability of high-reward tokens and decrease those of low-reward tokens, yet the optimal token’s probability remains largely unchanged during this stage. (2) Exploration stage: as training advances, the growth rate of previously acquired high-reward tokens slows as their probabilities approach saturation. When a potentially optimal token-now receiving positive advantage estimates-is occasionally sampled, its probability increases, while those of the originally high-reward tokens decrease. This dynamic suggests that over-exploitation during the exploitation stage may lead to capability boundary shrinkage, whereas prolonged training into the exploration stage can promote an expansion of the reasoning capability boundary. Building upon our insights, we revisit the potential of only using relative negative gradients for prolonging training, providing a theoretical and empirical foundation for the development of more advanced reasoning capabilities.

[832] Multi-Class Support Vector Machine with Differential Privacy

Jinseong Park, Yujin Choi, Jaewook Lee

Main category: cs.LG

TL;DR: Proposes a differentially private multi-class SVM (PMSVM) that avoids privacy budget waste by using all-in-one approaches instead of traditional one-vs-rest/one methods, with rigorous privacy guarantees.

DetailsMotivation: Standard DP approaches for multi-class SVMs waste privacy budget by repeatedly querying each data sample when building multiple binary classifiers in one-vs-rest/one methods.

Method: Developed PMSVM with weight and gradient perturbation methods, providing sensitivity and convergence analyses to ensure differential privacy in all-in-one SVMs that access each data sample only once.

Result: Empirical results show the approach surpasses existing DP-SVM methods in multi-class scenarios.

Conclusion: The proposed PMSVM provides an effective differentially private solution for multi-class SVM classification with better privacy budget utilization and performance.

Abstract: With the increasing need to safeguard data privacy in machine learning models, differential privacy (DP) is one of the major frameworks to build privacy-preserving models. Support Vector Machines (SVMs) are widely used traditional machine learning models due to their robust margin guarantees and strong empirical performance in binary classification. However, applying DP to multi-class SVMs is inadequate, as the standard one-versus-rest (OvR) and one-versus-one (OvO) approaches repeatedly query each data sample when building multiple binary classifiers, thus consuming the privacy budget proportionally to the number of classes. To overcome this limitation, we explore all-in-one SVM approaches for DP, which access each data sample only once to construct multi-class SVM boundaries with margin maximization properties. We propose a novel differentially Private Multi-class SVM (PMSVM) with weight and gradient perturbation methods, providing rigorous sensitivity and convergence analyses to ensure DP in all-in-one SVMs. Empirical results demonstrate that our approach surpasses existing DP-SVM methods in multi-class scenarios.

[833] Adaptive kernel-density approach for imbalanced binary classification

Kotaro J. Nishimura, Yuichi Sakumura, Kazushi Ikeda

Main category: cs.LG

TL;DR: KOTARO addresses severe class imbalance in binary classification by adaptively adjusting decision boundaries using kernel density estimation with dynamic bandwidth tuning.

DetailsMotivation: Class imbalance causes biased predictions toward majority classes, which is critical in domains like medical diagnosis and anomaly detection where minority class recognition is essential. Conventional methods fail under severe imbalance.

Method: Proposed KOTARO extends kernel density estimation by dynamically tuning Gaussian basis function bandwidth based on local sample density, adaptively adjusting decision boundaries to better capture minority regions.

Result: Experiments on synthetic and real-world imbalanced datasets showed KOTARO outperforms conventional methods, especially under severe imbalance conditions.

Conclusion: KOTARO demonstrates strong potential as a promising solution for a wide range of imbalanced classification problems, particularly when dealing with severe class imbalance.

Abstract: Class imbalance is a common challenge in real-world binary classification tasks, often leading to predictions biased toward the majority class and reduced recognition of the minority class. This issue is particularly critical in domains such as medical diagnosis and anomaly detection, where correct classification of minority classes is essential. Conventional methods often fail to deliver satisfactory performance when the imbalance ratio is extremely severe. To address this challenge, we propose a novel approach called Kernel-density-Oriented Threshold Adjustment with Regional Optimization (KOTARO), which extends the framework of kernel density estimation (KDE) by adaptively adjusting decision boundaries according to local sample density. In KOTARO, the bandwidth of Gaussian basis functions is dynamically tuned based on the estimated density around each sample, thereby enhancing the classifier’s ability to capture minority regions. We validated the effectiveness of KOTARO through experiments on both synthetic and real-world imbalanced datasets. The results demonstrated that KOTARO outperformed conventional methods, particularly under conditions of severe imbalance, highlighting its potential as a promising solution for a wide range of imbalanced classification problems

[834] Offline Reinforcement Learning in Large State Spaces: Algorithms and Guarantees

Nan Jiang, Tengyang Xie

Main category: cs.LG

TL;DR: Introduction to offline reinforcement learning theory in large state spaces, covering key concepts like expressivity assumptions and data coverage, with various algorithmic approaches and complexity guarantees.

DetailsMotivation: To establish theoretical foundations for offline reinforcement learning where policies are learned from historical data without online environment interactions, addressing the challenges of large state spaces.

Method: Introduces key theoretical concepts including expressivity assumptions (Bellman completeness vs. realizability) and data coverage conditions (all-policy vs. single-policy coverage), and describes a landscape of algorithms based on different assumption combinations.

Result: Presents a comprehensive theoretical framework with various algorithmic approaches that provide different sample and computational complexity guarantees depending on the chosen assumptions.

Conclusion: Establishes foundational theory for offline RL in large state spaces, identifies key trade-offs between assumptions and complexity guarantees, and discusses open questions and connections to adjacent research areas.

Abstract: This article introduces the theory of offline reinforcement learning in large state spaces, where good policies are learned from historical data without online interactions with the environment. Key concepts introduced include expressivity assumptions on function approximation (e.g., Bellman completeness vs. realizability) and data coverage (e.g., all-policy vs. single-policy coverage). A rich landscape of algorithms and results is described, depending on the assumptions one is willing to make and the sample and computational complexity guarantees one wishes to achieve. We also discuss open questions and connections to adjacent areas.

[835] Variational Diffusion Unlearning: A Variational Inference Framework for Unlearning in Diffusion Models under Data Constraints

Subhodip Panda, MS Varun, Shreyans Jain, Sarthak Kumar Maharana, Prathosh A. P

Main category: cs.LG

TL;DR: VDU is a machine unlearning method for diffusion models that prevents generation of undesired outputs using only a subset of undesired training data, combining plasticity induction and stability regularization.

DetailsMotivation: To safely deploy diffusion models by regulating generated outputs and preventing undesired, violent, or obscene content, especially in data-constrained settings where full training data is inaccessible.

Method: Variational Diffusion Unlearning (VDU) uses variational inference with a loss function containing plasticity inducer (reduces log-likelihood of undesired data) and stability regularizer (preserves generation quality by regularizing parameters).

Result: Effective class unlearning on MNIST, CIFAR-10, and tinyImageNet datasets from DDPM models, and feature unlearning on Stable Diffusion models.

Conclusion: VDU provides a computationally efficient solution for machine unlearning in diffusion models under data-constrained settings, successfully preventing generation of undesired content while maintaining image quality.

Abstract: For a responsible and safe deployment of diffusion models in various domains, regulating the generated outputs from these models is desirable because such models could generate undesired, violent, and obscene outputs. To tackle this problem, recent works use machine unlearning methodology to forget training data points containing these undesired features from pre-trained generative models. However, these methods proved to be ineffective in data-constrained settings where the whole training dataset is inaccessible. Thus, the principal objective of this work is to propose a machine unlearning methodology that can prevent the generation of outputs containing undesired features from a pre-trained diffusion model in such a data-constrained setting. Our proposed method, termed as Variational Diffusion Unlearning (VDU), is a computationally efficient method that only requires access to a subset of training data containing undesired features. Our approach is inspired by the variational inference framework with the objective of minimizing a loss function consisting of two terms: plasticity inducer and stability regularizer. Plasticity inducer reduces the log-likelihood of the undesired training data points, while the stability regularizer, essential for preventing loss of image generation quality, regularizes the model in parameter space. We validate the effectiveness of our method through comprehensive experiments for both class unlearning and feature unlearning. For class unlearning, we unlearn some user-identified classes from MNIST, CIFAR-10, and tinyImageNet datasets from a pre-trained unconditional denoising diffusion probabilistic model (DDPM). Similarly, for feature unlearning, we unlearn the generation of certain high-level features from a pre-trained Stable Diffusion model

[836] Using predefined vector systems as latent space configuration for neural network supervised training on data with arbitrarily large number of classes

Nikita Gabdullin

Main category: cs.LG

TL;DR: Proposes a methodology to train neural networks with fixed architecture regardless of class count by using predefined vector systems as target latent space configurations, enabling training on datasets with extremely large number of classes.

DetailsMotivation: Supervised learning methods require NN parameters dependent on class count, limiting applicability when number of classes is extremely large or unknown in advance.

Method: Use predefined vector systems as target latent space configuration during training, specifically randomly perturbed vectors of An root system, matching NN predictions with these predefined vectors.

Result: Successfully trained encoders and ViT on Cinic-10 and ImageNet-1K in low- and high-dimensional cases, and ViT on dataset with 1.28 million classes.

Conclusion: Method enables training on datasets with extremely large class counts and has potential applications in lifelong learning and NN distillation.

Abstract: Supervised learning (SL) methods are indispensable for neural network (NN) training used to perform classification tasks. While resulting in very high accuracy, SL training often requires making NN parameter number dependent on the number of classes, limiting their applicability when the number of classes is extremely large or unknown in advance. In this paper we propose a methodology that allows one to train the same NN architecture regardless of the number of classes. This is achieved by using predefined vector systems as the target latent space configuration (LSC) during NN training. We discuss the desired properties of target configurations and choose randomly perturbed vectors of An root system for our experiments. These vectors are used to successfully train encoders and visual transformers (ViT) on Cinic-10 and ImageNet-1K in low- and high-dimensional cases by matching NN predictions with the predefined vectors. Finally, ViT is trained on a dataset with 1.28 million classes illustrating the applicability of the method to training on datasets with extremely large number of classes. In addition, potential applications of LSC in lifelong learning and NN distillation are discussed illustrating versatility of the proposed methodology.

[837] Rethinking Consistent Multi-Label Classification under Inexact Supervision

Wei Wang, Tianhao Ma, Ming-Kun Xie, Gang Niu, Masashi Sugiyama

Main category: cs.LG

TL;DR: Proposes consistent approaches for partial and complementary multi-label learning that don’t require accurate estimation of label generation processes or uniform distribution assumptions, using unbiased risk estimators with theoretical guarantees.

DetailsMotivation: To address limitations in existing approaches for partial multi-label learning and complementary multi-label learning, which require accurate estimation of label generation processes or assume uniform distributions - conditions difficult to satisfy in real-world scenarios.

Method: Proposes two unbiased risk estimators based on first- and second-order strategies that handle both partial multi-label learning and complementary multi-label learning in a unified way without relying on the problematic assumptions.

Result: Theoretically proves consistency with respect to multi-label classification evaluation metrics and derives convergence rates. Empirically shows effectiveness against state-of-the-art methods through extensive experiments.

Conclusion: The proposed approaches provide consistent solutions for both partial and complementary multi-label learning problems without requiring the restrictive assumptions of existing methods, with strong theoretical guarantees and empirical performance.

Abstract: Partial multi-label learning and complementary multi-label learning are two popular weakly supervised multi-label classification paradigms that aim to alleviate the high annotation costs of collecting precisely annotated multi-label data. In partial multi-label learning, each instance is annotated with a candidate label set, among which only some labels are relevant; in complementary multi-label learning, each instance is annotated with complementary labels indicating the classes to which the instance does not belong. Existing consistent approaches for the two paradigms either require accurate estimation of the generation process of candidate or complementary labels or assume a uniform distribution to eliminate the estimation problem. However, both conditions are usually difficult to satisfy in real-world scenarios. In this paper, we propose consistent approaches that do not rely on the aforementioned conditions to handle both problems in a unified way. Specifically, we propose two unbiased risk estimators based on first- and second-order strategies. Theoretically, we prove consistency w.r.t. two widely used multi-label classification evaluation metrics and derive convergence rates for the estimation errors of the proposed risk estimators. Empirically, extensive experimental results validate the effectiveness of our proposed approaches against state-of-the-art methods.

[838] Why Cannot Neural Networks Master Extrapolation? Insights from Physical Laws

Ramzi Dakhmouche, Hossein Gorji

Main category: cs.LG

TL;DR: The paper identifies a fundamental property that explains why deep learning models struggle with extrapolation in time series forecasting, contrasting with physical laws that have strong extrapolation capabilities.

DetailsMotivation: Foundation Models excel in short-range forecasting but fail at long-range extrapolation, performing worse than simple baselines, which contrasts with physical laws' strong extrapolation properties.

Method: The authors identify and formalize a fundamental property characterizing statistical learning models’ ability to predict outside their training domain, supported by theoretical analysis and empirical results on current deep learning architectures.

Result: The research clarifies the root causes of the extrapolation gap in deep learning models and demonstrates performance deterioration in extrapolation settings through empirical evidence.

Conclusion: The findings suggest directions for designing next-generation forecasting models capable of mastering extrapolation by addressing the fundamental differences between neural network structure and physical laws.

Abstract: Motivated by the remarkable success of Foundation Models (FMs) in language modeling, there has been growing interest in developing FMs for time series prediction, given the transformative power such models hold for science and engineering. This culminated in significant success of FMs in short-range forecasting settings. However, extrapolation or long-range forecasting remains elusive for FMs, which struggle to outperform even simple baselines. This contrasts with physical laws which have strong extrapolation properties, and raises the question of the fundamental difference between the structure of neural networks and physical laws. In this work, we identify and formalize a fundamental property characterizing the ability of statistical learning models to predict more accurately outside of their training domain, hence explaining performance deterioration for deep learning models in extrapolation settings. In addition to a theoretical analysis, we present empirical results showcasing the implications of this property on current deep learning architectures. Our results not only clarify the root causes of the extrapolation gap but also suggest directions for designing next-generation forecasting models capable of mastering extrapolation.

[839] Attending on Multilevel Structure of Proteins enables Accurate Prediction of Cold-Start Drug-Target Interactions

Ziying Zhang, Yaqing Wang, Yuxuan Sun, Min Ye, Quanming Yao

Main category: cs.LG

TL;DR: ColdDTI is a framework for cold-start drug-target interaction prediction that uses hierarchical attention to model multi-level protein structures (primary to quaternary) and their interactions with drug structures at different granularities.

DetailsMotivation: Existing methods only use primary protein structures, but proteomics shows that multi-level protein structures all influence drug-target interactions. This limitation prevents capturing interactions involving higher-level structures.

Method: Uses hierarchical attention mechanism to mine interactions between multi-level protein structures (primary to quaternary) and drug structures at both local and global granularities, then fuses structure representations for prediction.

Result: Experiments on benchmark datasets show ColdDTI consistently outperforms previous methods in cold-start settings.

Conclusion: The framework captures biologically transferable priors and avoids overfitting from excessive reliance on representation learning, demonstrating the importance of modeling multi-level protein structures for cold-start DTI prediction.

Abstract: Cold-start drug-target interaction (DTI) prediction focuses on interaction between novel drugs and proteins. Previous methods typically learn transferable interaction patterns between structures of drug and proteins to tackle it. However, insight from proteomics suggest that protein have multi-level structures and they all influence the DTI. Existing works usually represent protein with only primary structures, limiting their ability to capture interactions involving higher-level structures. Inspired by this insight, we propose ColdDTI, a framework attending on protein multi-level structure for cold-start DTI prediction. We employ hierarchical attention mechanism to mine interaction between multi-level protein structures (from primary to quaternary) and drug structures at both local and global granularities. Then, we leverage mined interactions to fuse structure representations of different levels for final prediction. Our design captures biologically transferable priors, avoiding the risk of overfitting caused by excessive reliance on representation learning. Experiments on benchmark datasets demonstrate that ColdDTI consistently outperforms previous methods in cold-start settings.

[840] Can Linear Probes Measure LLM Uncertainty?

Ramzi Dakhmouche, Adrien Letellier, Hossein Gorji

Main category: cs.LG

TL;DR: The paper proposes a Bayesian approach for uncertainty quantification in LLMs using linear regression models between layers, outperforming current methods.

DetailsMotivation: Current uncertainty quantification methods for LLMs with multiple choice structure are dominated by naive baselines like maximum softmax score, which is insufficient for reliable deployment.

Method: Train multiple Bayesian linear models to predict each layer’s output from the previous layer, then infer global uncertainty by identifying sparse combinations of distributional features from the layer-level posterior distributions.

Result: Numerical experiments on various LLMs show consistent improvement over state-of-the-art baselines.

Conclusion: A principled Bayesian approach with simple linear models provides effective uncertainty quantification for LLMs, addressing the shortcomings of current methods.

Abstract: Effective Uncertainty Quantification (UQ) represents a key aspect for reliable deployment of Large Language Models (LLMs) in automated decision-making and beyond. Yet, for LLM generation with multiple choice structure, the state-of-the-art in UQ is still dominated by the naive baseline given by the maximum softmax score. To address this shortcoming, we demonstrate that taking a principled approach via Bayesian statistics leads to improved performance despite leveraging the simplest possible model, namely linear regression. More precisely, we propose to train multiple Bayesian linear models, each predicting the output of a layer given the output of the previous one. Based on the obtained layer-level posterior distributions, we infer the global uncertainty level of the LLM by identifying a sparse combination of distributional features, leading to an efficient UQ scheme. Numerical experiments on various LLMs show consistent improvement over state-of-the-art baselines.

[841] Wasserstein projection distance for fairness testing of regression models

Wanxin Li, Yongjin P. Park, Khanh Dao Duc

Main category: cs.LG

TL;DR: A Wasserstein projection-based framework for fairness testing in regression models, addressing expectation-based fairness criteria through hypothesis testing and optimal data perturbation.

DetailsMotivation: Most fairness research focuses on classification tasks, leaving regression models underexplored despite their critical importance in real-world applications.

Method: Proposes a hypothesis-testing approach and optimal data perturbation method using Wasserstein projection framework, with theoretical analysis including dual reformulation, asymptotic bounds, and limiting distributions.

Result: Experiments show higher specificity than permutation-based tests and effective bias detection/mitigation in student performance and housing price prediction applications.

Conclusion: The proposed framework successfully addresses fairness in regression models, offering improved testing specificity and practical bias mitigation capabilities.

Abstract: Fairness in machine learning is a critical concern, yet most research has focused on classification tasks, leaving regression models underexplored. This paper introduces a Wasserstein projection-based framework for fairness testing in regression models, focusing on expectation-based criteria. We propose a hypothesis-testing approach and an optimal data perturbation method to improve fairness while balancing accuracy. Theoretical results include a detailed categorization of fairness criteria for regression, a dual reformulation of the Wasserstein projection test statistic, and the derivation of asymptotic bounds and limiting distributions. Experiments on synthetic and real-world datasets demonstrate that the proposed method offers higher specificity compared to permutation-based tests, and effectively detects and mitigates biases in real applications such as student performance and housing price prediction.

[842] On the Limitations and Capabilities of Position Embeddings for Length Generalization

Yang Chen, Yitao Liang, Zhouchen Lin

Main category: cs.LG

TL;DR: Position Embeddings (PEs) structure computations across positions rather than expand capabilities. Length Generalization (LG) depends on Sequential Representation Complexity (SRC) remaining invariant across scales. The paper introduces Scale Hint and Learning-Based Position Embedding to improve LG.

DetailsMotivation: To understand the fundamental role of Position Embeddings in Transformers for Length Generalization, as their limitations and capabilities remain unclear despite significant influence on LG performance.

Method: Theoretical analysis of PEs in Position-Only Linear Attentions (POLAs) using Linear Representation Complexity (LRC), extension to practical Transformers with Sequential Representation Complexity (SRC), empirical validation, and introduction of Scale Hint and Learning-Based Position Embedding framework.

Result: Analysis shows PEs structure learned computations across positions rather than expand computational capabilities. LG is possible when SRC remains invariant across scales, supported by empirical evidence. Proposed methods enhance LG performance.

Conclusion: The work provides theoretical insights into PEs’ role in LG and practical strategies (Scale Hint and Learning-Based Position Embedding) to improve length generalization in Transformers.

Abstract: In Transformers, Position Embeddings (PEs) significantly influence Length Generalization (LG) performance, yet their fundamental role remains unclear. In this work, we investigate the limitations and capabilities of PEs in achieving LG. We theoretically analyze PEs in Position-Only Linear Attentions (POLAs), introducing Linear Representation Complexity (LRC) to characterize when PEs enable LG. Our analysis shows that PEs do not expand computational capabilities but structure learned computations across positions. Extending to practical Transformers, we propose Sequential Representation Complexity (SRC) and conjecture that LG is possible if and only if SRC remains invariant across scales. We support this hypothesis with empirical evidence in various reasoning tasks. To enhance LG, we introduce Scale Hint, allowing flexible instance scaling, and a Learning-Based Position Embedding framework that automatically learns positional relations. Our work provides theoretical insights and practical strategies for improving LG in Transformers.

[843] On the Statistical Query Complexity of Learning Semiautomata: a Random Walk Approach

George Giapitzakis, Kimon Fountoulakis, Eshaan Nichani, Jason D. Lee

Main category: cs.LG

TL;DR: Statistical Query hardness for semiautomata is established under uniform distribution, showing polynomial alphabet/input length suffices for hardness based on state-transition structure rather than language recognition.

DetailsMotivation: Semiautomata have applications in NLP, robotics, biology, and data mining, but their computational hardness properties weren't well understood under Statistical Query framework.

Method: Reduced distinguishing semiautomata to analyzing random walks on S_N × S_N group, applied Fourier analysis and symmetric group representation theory to obtain spectral gap bounds.

Result: After polynomial steps in number of states, distinct semiautomata become nearly uncorrelated, establishing Statistical Query hardness.

Conclusion: Semiautomata exhibit inherent Statistical Query hardness from their state-transition structure, not just from language complexity like DFA parity problems.

Abstract: Semiautomata form a rich class of sequence-processing algorithms with applications in natural language processing, robotics, computational biology, and data mining. We establish the first Statistical Query hardness result for semiautomata under the uniform distribution over input words and initial states. We show that Statistical Query hardness can be established when both the alphabet size and input length are polynomial in the number of states. Unlike the case of deterministic finite automata, where hardness typically arises through the hardness of the language they recognize (e.g., parity), our result is derived solely from the internal state-transition structure of semiautomata. Our analysis reduces the task of distinguishing the final states of two semiautomata to studying the behavior of a random walk on the group $S_{N} \times S_{N}$. By applying tools from Fourier analysis and the representation theory of the symmetric group, we obtain tight spectral gap bounds, demonstrating that after a polynomial number of steps in the number of states, distinct semiautomata become nearly uncorrelated, yielding the desired hardness result.

[844] PhaseFormer: From Patches to Phases for Efficient and Effective Time Series Forecasting

Yiming Niu, Jinliang Deng, Yongxin Tong

Main category: cs.LG

TL;DR: PhaseFormer introduces an efficient time series forecasting method using phase embeddings and lightweight routing, achieving state-of-the-art performance with only ~1k parameters.

DetailsMotivation: Current deep learning methods using patch-level processing are inefficient due to large parameter counts and high computational costs, despite their effectiveness in exploiting periodicity.

Method: Proposes PhaseFormer with phase-wise prediction using compact phase embeddings and efficient cross-phase interaction via lightweight routing mechanism.

Result: Achieves state-of-the-art performance with only around 1k parameters across benchmark datasets, excelling particularly on large-scale and complex datasets.

Conclusion: PhaseFormer represents a significant advancement toward truly efficient and effective time series forecasting by addressing the inefficiency of patch-level processing.

Abstract: Periodicity is a fundamental characteristic of time series data and has long played a central role in forecasting. Recent deep learning methods strengthen the exploitation of periodicity by treating patches as basic tokens, thereby improving predictive effectiveness. However, their efficiency remains a bottleneck due to large parameter counts and heavy computational costs. This paper provides, for the first time, a clear explanation of why patch-level processing is inherently inefficient, supported by strong evidence from real-world data. To address these limitations, we introduce a phase perspective for modeling periodicity and present an efficient yet effective solution, PhaseFormer. PhaseFormer features phase-wise prediction through compact phase embeddings and efficient cross-phase interaction enabled by a lightweight routing mechanism. Extensive experiments demonstrate that PhaseFormer achieves state-of-the-art performance with around 1k parameters, consistently across benchmark datasets. Notably, it excels on large-scale and complex datasets, where models with comparable efficiency often struggle. This work marks a significant step toward truly efficient and effective time series forecasting. Code is available at this repository: https://github.com/neumyor/PhaseFormer_TSL

[845] Modeling Time Series Dynamics with Fourier Ordinary Differential Equations

Muhao Guo, Yang Weng

Main category: cs.LG

TL;DR: FODEs (Fourier Ordinary Differential Equations) address limitations of Neural ODEs by modeling dynamics in the Fourier domain, using FFT to capture global patterns and periodic behaviors, with learnable filtering to align continuous outputs with discrete observations.

DetailsMotivation: Neural ODEs struggle with capturing long-term dependencies and periodic structures due to time-domain limitations, and face granularity loss from the mismatch between continuous-time formulation and discrete real-world data.

Method: Transform time-series data to frequency domain using FFT, model dynamics in Fourier domain, and introduce learnable element-wise filtering to align continuous model outputs with discrete observations.

Result: Experiments show FODEs outperform existing methods in accuracy and efficiency across various time series datasets, effectively capturing both long- and short-term patterns.

Conclusion: FODEs provide a robust framework for time series modeling by leveraging Fourier domain representations to overcome limitations of traditional Neural ODEs.

Abstract: Neural ODEs (NODEs) have emerged as powerful tools for modeling time series data, offering the flexibility to adapt to varying input scales and capture complex dynamics. However, they face significant challenges: first, their reliance on time-domain representations often limits their ability to capture long-term dependencies and periodic structures; second, the inherent mismatch between their continuous-time formulation and the discrete nature of real-world data can lead to loss of granularity and predictive accuracy. To address these limitations, we propose Fourier Ordinary Differential Equations (FODEs), an approach that embeds the dynamics in the Fourier domain. By transforming time-series data into the frequency domain using the Fast Fourier Transform (FFT), FODEs uncover global patterns and periodic behaviors that remain elusive in the time domain. Additionally, we introduce a learnable element-wise filtering mechanism that aligns continuous model outputs with discrete observations, preserving granularity and enhancing accuracy. Experiments on various time series datasets demonstrate that FODEs outperform existing methods in terms of both accuracy and efficiency. By effectively capturing both long- and short-term patterns, FODEs provide a robust framework for modeling time series dynamics.

[846] Efficient Manifold-Constrained Neural ODE for High-Dimensional Datasets

Muhao Guo, Haoran Li, Yang Weng

Main category: cs.LG

TL;DR: Proposes a novel approach to improve Neural ODEs for high-dimensional data by discovering and leveraging the underlying manifold structure, achieving better computational efficiency and accuracy.

DetailsMotivation: Neural ODEs struggle with high-dimensional systems due to extensive calculations and high truncation errors. Existing methods require prior knowledge of the manifold structure, which is often unavailable in real scenarios.

Method: Uses a structure-preserved encoder to discover the underlying manifold as a graph approximation, then combines this with NODE learning to restrict the ODE process on the manifold.

Result: Demonstrates superior performance across multiple datasets with improved accuracy, reduced number of function evaluations (NFEs), and faster convergence compared to existing baselines.

Conclusion: The approach effectively addresses high-dimensional challenges in NODEs by automatically discovering and leveraging manifold structure, resulting in significant computational and accuracy gains.

Abstract: Neural ordinary differential equations (NODE) have garnered significant attention for their design of continuous-depth neural networks and the ability to learn data/feature dynamics. However, for high-dimensional systems, estimating dynamics requires extensive calculations and suffers from high truncation errors for the ODE solvers. To address the issue, one intuitive approach is to consider the non-trivial topological space of the data distribution, i.e., a low-dimensional manifold. Existing methods often rely on knowledge of the manifold for projection or implicit transformation, restricting the ODE solutions on the manifold. Nevertheless, such knowledge is usually unknown in realistic scenarios. Therefore, we propose a novel approach to explore the underlying manifold to restrict the ODE process. Specifically, we employ a structure-preserved encoder to process data and find the underlying graph to approximate the manifold. Moreover, we propose novel methods to combine the NODE learning with the manifold, resulting in significant gains in computational speed and accuracy. Our experimental evaluations encompass multiple datasets, where we compare the accuracy, number of function evaluations (NFEs), and convergence speed of our model against existing baselines. Our results demonstrate superior performance, underscoring the effectiveness of our approach in addressing the challenges of high-dimensional datasets.

[847] Finite Time Analysis of Constrained Natural Critic-Actor Algorithm with Improved Sample Complexity

Prashansa Panda, Shalabh Bhatnagar

Main category: cs.LG

TL;DR: First natural critic-actor algorithm with function approximation for long-run average cost setting under inequality constraints, with non-asymptotic convergence guarantees and competitive performance on Safety-Gym environments.

DetailsMotivation: Previous studies focused on discounted cost settings and either provided only asymptotic convergence or used tabular representations. There was a gap for long-run average cost settings with function approximation and inequality constraints.

Method: Proposed a natural critic-actor algorithm with function approximation for long-run average cost setting under inequality constraints, with optimal learning rates and a modification to improve sample complexity.

Result: Established non-asymptotic convergence guarantees with optimal learning rates. Experimental results on three Safety-Gym environments showed competitive performance compared to other well-known algorithms.

Conclusion: The proposed critic-actor algorithm successfully addresses the long-run average cost setting with inequality constraints, providing theoretical guarantees and practical competitive performance.

Abstract: Recent studies have increasingly focused on non-asymptotic convergence analyses for actor-critic (AC) algorithms. One such effort introduced a two-timescale critic-actor algorithm for the discounted cost setting using a tabular representation, where the usual roles of the actor and critic are reversed. However, only asymptotic convergence was established there. Subsequently, both asymptotic and non-asymptotic analyses of the critic-actor algorithm with linear function approximation were conducted. In our work, we introduce the first natural critic-actor algorithm with function approximation for the long-run average cost setting and under inequality constraints. We provide the non-asymptotic convergence guarantees for this algorithm. Our analysis establishes optimal learning rates and we also propose a modification to enhance sample complexity. We further show the results of experiments on three different Safety-Gym environments where our algorithm is found to be competitive in comparison with other well known algorithms.

[848] Spectral Alignment as Predictor of Loss Explosion in Neural Network Training

Haiquan Qiu, You Wu, Yingjie Tan, Yaqing Wang, Quanming Yao

Main category: cs.LG

TL;DR: Spectral Alignment (SA) is a novel metric that monitors alignment between layer inputs and weight matrix singular vectors, providing early warning of loss explosions in deep neural network training.

DetailsMotivation: Loss explosions can nullify expensive training runs, and conventional metrics like weight/gradient norms are lagging and ambiguous predictors that vary across models and layers.

Method: Introduce Spectral Alignment (SA) metric that monitors distributional alignment between layer inputs and principal singular vectors of weight matrices. Track collapse in sign diversity of this alignment.

Result: SA provides significantly earlier and clearer warning of loss explosions than traditional scalar metrics. Demonstrated effectiveness on language models with low computational overhead.

Conclusion: SA is a practical, theoretically-grounded tool for safeguarding model training by detecting impending failure through monitoring representational collapse.

Abstract: Loss explosions in training deep neural networks can nullify multi-million dollar training runs. Conventional monitoring metrics like weight and gradient norms are often lagging and ambiguous predictors, as their values vary dramatically across different models and even between layers of the same model, making it difficult to establish a unified standard for detecting impending failure. We introduce Spectral Alignment (SA), a novel, theoretically-grounded metric that monitors the distributional alignment between layer inputs and the principal singular vectors of weight matrices. We show that a collapse in the sign diversity of this alignment is a powerful early predictor of representational collapse and training divergence. Empirical results on language models demonstrate that monitoring the SA distribution provides a significantly earlier and clearer warning of loss explosions than traditional scalar metrics. SA’s low computational overhead makes it a practical tool for safeguarding model training.

[849] PolyKAN: A Polyhedral Analysis Framework for Provable and Minimal KAN Compression

Di Zhang

Main category: cs.LG

TL;DR: PolyKAN is a theoretical framework for compressing Kolmogorov-Arnold Networks (KANs) that provides formal guarantees on model size reduction and approximation error through optimal polyhedral region merging.

DetailsMotivation: KANs offer enhanced interpretability over traditional MLPs but suffer from parameter inefficiency, limiting their practical deployment. There is a need for compression methods with mathematical guarantees.

Method: Leverages the piecewise polynomial structure of KANs, formulates compression as optimal polyhedral region merging, establishes polyhedral characterization, develops ε-equivalent compression theory, and designs an optimal dynamic programming algorithm.

Result: PolyKAN achieves provably minimal compression while maintaining strict error control, with polynomial-time complexity in all network parameters.

Conclusion: The framework provides the first formal foundation for KAN compression with mathematical guarantees, enabling efficient deployment of interpretable neural architectures.

Abstract: Kolmogorov-Arnold Networks (KANs) have emerged as a promising alternative to traditional Multi-Layer Perceptrons (MLPs), offering enhanced interpretability and a strong mathematical foundation. However, their parameter efficiency remains a significant challenge for practical deployment. This paper introduces PolyKAN, a novel theoretical framework for KAN compression that provides formal guarantees on both model size reduction and approximation error. By leveraging the inherent piecewise polynomial structure of KANs, we formulate the compression problem as one of optimal polyhedral region merging. We establish a rigorous polyhedral characterization of KANs, develop a complete theory of $\epsilon$-equivalent compression, and design an optimal dynamic programming algorithm that guarantees minimal compression under specified error bounds. Our theoretical analysis demonstrates that PolyKAN achieves provably minimal compression while maintaining strict error control, with polynomial-time complexity in all network parameters. The framework provides the first formal foundation for KAN compression with mathematical guarantees, opening new directions for efficient deployment of interpretable neural architectures.

[850] Adaptive Federated Learning via Dynamical System Model

Aayushya Agarwal, Larry Pileggi, Gauri Joshi

Main category: cs.LG

TL;DR: An adaptive federated learning method that automatically tunes learning rates and momentum parameters by modeling FL as a dynamical system, eliminating manual hyperparameter tuning for heterogeneous settings.

DetailsMotivation: Manual hyperparameter tuning is expensive and challenging in heterogeneous federated learning with non-IID data and varying client capabilities, requiring automated solutions.

Method: Models federated learning as a dynamical system, using principles from numerical simulation to adaptively select learning rates and momentum parameters for both clients and central servers.

Result: Achieves fast convergence while being insensitive to global hyperparameter choice, outperforming state-of-the-art adaptive methods in heterogeneous settings.

Conclusion: The framework provides a fully integrated solution that handles objective inconsistency and client drift, enabling rapid prototyping and scalable deployment without manual hyperparameter tuning.

Abstract: Hyperparameter selection is critical for stable and efficient convergence of heterogeneous federated learning, where clients differ in computational capabilities, and data distributions are non-IID. Tuning hyperparameters is a manual and computationally expensive process as the hyperparameter space grows combinatorially with the number of clients. To address this, we introduce an end-to-end adaptive federated learning method in which both clients and central agents adaptively select their local learning rates and momentum parameters. Our approach models federated learning as a dynamical system, allowing us to draw on principles from numerical simulation and physical design. Through this perspective, selecting momentum parameters equates to critically damping the system for fast, stable convergence, while learning rates for clients and central servers are adaptively selected to satisfy accuracy properties from numerical simulation. The result is an adaptive, momentum-based federated learning algorithm in which the learning rates for clients and servers are dynamically adjusted and controlled by a single, global hyperparameter. By designing a fully integrated solution for both adaptive client updates and central agent aggregation, our method is capable of handling key challenges of heterogeneous federated learning, including objective inconsistency and client drift. Importantly, our approach achieves fast convergence while being insensitive to the choice of the global hyperparameter, making it well-suited for rapid prototyping and scalable deployment. Compared to state-of-the-art adaptive methods, our framework is shown to deliver superior convergence for heterogeneous federated learning while eliminating the need for hyperparameter tuning both client and server updates.

[851] Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention

Haiquan Qiu, Quanming Yao

Main category: cs.LG

TL;DR: This paper explains why flash attention with low-precision training causes catastrophic loss explosions, identifying two key factors: similar low-rank representations in attention and biased rounding errors that create a vicious cycle of error accumulation.

DetailsMotivation: To understand and solve the persistent training instability problem when using flash attention with low-precision formats, which hinders computational efficiency in transformer training.

Method: Conducted in-depth mechanistic analysis of the failure case, identified two intertwined phenomena (low-rank representations and biased rounding errors), and introduced a minimal modification to flash attention to mitigate rounding bias.

Result: The analysis revealed that the failure is systematic, not random, and the proposed simple modification successfully stabilizes the training process by addressing the biased rounding errors.

Conclusion: The paper provides the first mechanistic explanation for this long-standing failure case and offers a practical solution that confirms the analysis while enabling stable low-precision training with flash attention.

Abstract: The pursuit of computational efficiency has driven the adoption of low-precision formats for training transformer models. However, this progress is often hindered by notorious training instabilities. This paper provides the first mechanistic explanation for a long-standing and unresolved failure case where training with flash attention in low-precision settings leads to catastrophic loss explosions. Our in-depth analysis reveals that the failure is not a random artifact but caused by two intertwined phenomena: the emergence of similar low-rank representations within the attention mechanism and the compounding effect of biased rounding errors inherent in low-precision arithmetic. We demonstrate how these factors create a vicious cycle of error accumulation that corrupts weight updates, ultimately derailing the training dynamics. To validate our findings, we introduce a minimal modification to the flash attention that mitigates the bias in rounding errors. This simple change stabilizes the training process, confirming our analysis and offering a practical solution to this persistent problem.

[852] MLLMEraser: Achieving Test-Time Unlearning in Multimodal Large Language Models through Activation Steering

Chenlu Ding, Jiancan Wu, Leheng Sheng, Fan Zhang, Yancheng Yuan, Xiang Wang, Xiangnan He

Main category: cs.LG

TL;DR: MLLMEraser is a training-free framework for test-time unlearning in multimodal large language models that uses activation steering to dynamically erase knowledge without parameter updates, outperforming existing methods with lower computational cost.

DetailsMotivation: Address concerns about memorized private data, outdated knowledge, and harmful content in deployed MLLMs, while overcoming limitations of existing training-based unlearning approaches that are computationally expensive, irreversible, and distort retained knowledge.

Method: Uses activation steering with multimodal erasure directions constructed by contrasting adversarially perturbed knowledge-recall and knowledge-erasure image-text pairs, plus an input-aware steering mechanism that adaptively applies erasure only when needed.

Result: Consistently outperforms state-of-the-art MLLM unlearning baselines on LLaVA-1.5 and Qwen-2.5-VL, achieving stronger forgetting performance with lower computational cost and minimal utility degradation.

Conclusion: MLLMEraser provides an effective training-free solution for test-time unlearning in MLLMs, enabling dynamic knowledge erasure while preserving utility on retained knowledge.

Abstract: Multimodal large language models (MLLMs) have demonstrated remarkable capabilities across vision-language tasks, yet their large-scale deployment raises pressing concerns about memorized private data, outdated knowledge, and harmful content. Existing unlearning approaches for MLLMs typically adapt training-based strategies such as gradient ascent or preference optimization, but these methods are computationally expensive, irreversible, and often distort retained knowledge. In this work, we propose MLLMEraser, an input-aware, training-free framework for test-time unlearning. Our approach leverages activation steering to enable dynamic knowledge erasure without parameter updates. Specifically, we construct a multimodal erasure direction by contrasting adversarially perturbed, knowledge-recall image-text pairs with knowledge-erasure counterparts, capturing both textual and visual discrepancies. To prevent unnecessary interference, we further design an input-aware steering mechanism that adaptively determines when and how the erasure direction should be applied, preserving utility on retained knowledge while enforcing forgetting on designated content. Experiments on LLaVA-1.5 and Qwen-2.5-VL demonstrate that MLLMEraser consistently outperforms state-of-the-art MLLM unlearning baselines, achieving stronger forgetting performance with lower computational cost and minimal utility degradation.

[853] Physics-Inspired All-Pair Interaction Learning for 3D Dynamics Modeling

Kai Yang, Yuqi Huang, Junheng Tao, Wanyu Wang, Qitian Wu

Main category: cs.LG

TL;DR: PAINET is an SE(3)-equivariant neural architecture that learns all-pair interactions in multi-body systems through physics-inspired attention and parallel decoding, achieving significant error reductions in 3D dynamics prediction.

DetailsMotivation: Existing GNN-based approaches for 3D dynamics modeling depend on explicitly observed structures and fail to capture unobserved interactions that are crucial to complex physical behaviors and dynamics mechanisms.

Method: PAINET comprises: (1) a physics-inspired attention network derived from energy function minimization trajectory, and (2) a parallel decoder that preserves equivariance while enabling efficient inference.

Result: PAINET consistently outperforms recent models on diverse benchmarks (human motion capture, molecular dynamics, protein simulations), yielding 4.7% to 41.5% error reductions in 3D dynamics prediction with comparable computation costs.

Conclusion: PAINET provides an effective SE(3)-equivariant approach for learning all-pair interactions in multi-body systems, significantly improving 3D dynamics prediction across various domains.

Abstract: Modeling 3D dynamics is a fundamental problem in multi-body systems across scientific and engineering domains and has important practical implications in trajectory prediction and simulation. While recent GNN-based approaches have achieved strong performance by enforcing geometric symmetries, encoding high-order features or incorporating neural-ODE mechanics, they typically depend on explicitly observed structures and inherently fail to capture the unobserved interactions that are crucial to complex physical behaviors and dynamics mechanism. In this paper, we propose PAINET, a principled SE(3)-equivariant neural architecture for learning all-pair interactions in multi-body systems. The model comprises: (1) a novel physics-inspired attention network derived from the minimization trajectory of an energy function, and (2) a parallel decoder that preserves equivariance while enabling efficient inference. Empirical results on diverse real-world benchmarks, including human motion capture, molecular dynamics, and large-scale protein simulations, show that PAINET consistently outperforms recently proposed models, yielding 4.7% to 41.5% error reductions in 3D dynamics prediction with comparable computation costs in terms of time and memory.

[854] Truncated Kernel Stochastic Gradient Descent with General Losses and Spherical Radial Basis Functions

Jinhui Bai, Andreas Christmann, Lei Shi

Main category: cs.LG

TL;DR: A novel kernel SGD algorithm with improved efficiency and scalability through adaptive regularization and spectral analysis, achieving minimax-optimal convergence rates and reduced computational complexity.

DetailsMotivation: To address the inefficiency and scalability limitations of traditional kernel SGD methods for large-scale supervised learning with general losses.

Method: Proposes a kernel SGD algorithm using an innovative regularization strategy that projects stochastic gradients onto finite-dimensional hypothesis spaces via spherical radial basis function expansion, with adaptive scaling based on bias-variance trade-off. Uses spectral structure analysis of kernel-induced covariance operators.

Result: Achieves minimax-optimal convergence rates for both last iterate and suffix average, establishes optimal strong convergence in RKHS, accommodates various loss functions (least-squares, Huber, logistic), and significantly reduces computational and storage complexity.

Conclusion: The proposed algorithm provides an efficient and scalable solution for large-scale kernel learning with strong theoretical guarantees and practical performance improvements over traditional kernel SGD methods.

Abstract: In this paper, we propose a novel kernel stochastic gradient descent (SGD) algorithm for large-scale supervised learning with general losses. Compared to traditional kernel SGD, our algorithm improves efficiency and scalability through an innovative regularization strategy. By leveraging the infinite series expansion of spherical radial basis functions, this strategy projects the stochastic gradient onto a finite-dimensional hypothesis space, which is adaptively scaled according to the bias-variance trade-off, thereby enhancing generalization performance. Based on a new estimation of the spectral structure of the kernel-induced covariance operator, we develop an analytical framework that unifies optimization and generalization analyses. We prove that both the last iterate and the suffix average converge at minimax-optimal rates, and we further establish optimal strong convergence in the reproducing kernel Hilbert space. Our framework accommodates a broad class of classical loss functions, including least-squares, Huber, and logistic losses. Moreover, the proposed algorithm significantly reduces computational complexity and achieves optimal storage complexity by incorporating coordinate-wise updates from linear SGD, thereby avoiding the costly pairwise operations typical of kernel SGD and enabling efficient processing of streaming data. Finally, extensive numerical experiments demonstrate the efficiency of our approach.

[855] Diffusion-Assisted Distillation for Self-Supervised Graph Representation Learning with MLPs

Seong Jin Ahn, Myoung-Ho Kim

Main category: cs.LG

TL;DR: DAD-SGM is a new distillation method that uses denoising diffusion models as teacher assistants to bridge the capacity gap between GNNs and MLPs in self-supervised graph representation learning, enhancing MLP performance.

DetailsMotivation: There's growing interest in replacing GNNs with lightweight MLPs via knowledge distillation, but distilling self-supervised graph representation learning is challenging due to the importance of inductive bias in self-supervised performance.

Method: Proposes DAD-SGM which employs a denoising diffusion model as a teacher assistant to better distill knowledge from teacher GNNs into student MLPs.

Result: Extensive experiments show DAD-SGM effectively distills knowledge from self-supervised GNNs compared to state-of-the-art GNN-to-MLP distillation methods.

Conclusion: The diffusion-assisted approach enhances the generalizability and robustness of MLPs in self-supervised graph representation learning.

Abstract: For large-scale applications, there is growing interest in replacing Graph Neural Networks (GNNs) with lightweight Multi-Layer Perceptrons (MLPs) via knowledge distillation. However, distilling GNNs for self-supervised graph representation learning into MLPs is more challenging. This is because the performance of self-supervised learning is more related to the model’s inductive bias than supervised learning. This motivates us to design a new distillation method to bridge a huge capacity gap between GNNs and MLPs in self-supervised graph representation learning. In this paper, we propose \textbf{D}iffusion-\textbf{A}ssisted \textbf{D}istillation for \textbf{S}elf-supervised \textbf{G}raph representation learning with \textbf{M}LPs (DAD-SGM). The proposed method employs a denoising diffusion model as a teacher assistant to better distill the knowledge from the teacher GNN into the student MLP. This approach enhances the generalizability and robustness of MLPs in self-supervised graph representation learning. Extensive experiments demonstrate that DAD-SGM effectively distills the knowledge of self-supervised GNNs compared to state-of-the-art GNN-to-MLP distillation methods. Our implementation is available at https://github.com/SeongJinAhn/DAD-SGM.

[856] Efficient Latent Variable Causal Discovery: Combining Score Search and Targeted Testing

Joseph Ramsey, Bryan Andrews

Main category: cs.LG

TL;DR: The paper introduces several improved causal discovery algorithms for handling latent variables and selection bias, including BOSS-FCI, GRaSP-FCI, FCIT, and LV-Dumb, which use score-guided and targeted testing strategies to overcome limitations of traditional FCI.

DetailsMotivation: Traditional FCI algorithm performs exhaustive conditional independence tests that lead to spurious independence claims, unreliable orientations, and scalability issues when latent variables or selection bias are present.

Method: Developed four methods: 1) BOSS-FCI and GRaSP-FCI as variants using different search algorithms, 2) FCIT with targeted testing guided by BOSS instead of exhaustive tests, 3) LV-Dumb heuristic that directly returns PAG from BOSS DAG bypassing latent-variable reasoning.

Result: Simulations and real-data analyses show BOSS-FCI and GRaSP-FCI provide sound baselines, FCIT improves efficiency and reliability, and LV-Dumb achieves superior accuracy in practice despite not being strictly correct.

Conclusion: Score-guided and targeted strategies enable scalable latent-variable causal discovery, with the proposed methods offering improved performance over traditional approaches.

Abstract: Learning causal structure from observational data is especially challenging when latent variables or selection bias are present. The Fast Causal Inference (FCI) algorithm addresses this setting but often performs exhaustive conditional independence tests across many subsets, leading to spurious independence claims, extra or missing edges, and unreliable orientations. We present a family of score-guided mixed-strategy causal search algorithms that build on this tradition. First, we introduce BOSS-FCI and GRaSP-FCI, straightforward variants of GFCI that substitute BOSS or GRaSP for FGES, thereby retaining correctness while incurring different scalability tradeoffs. Second, we develop FCI Targeted-testing (FCIT), a novel mixed-strategy method that improves upon these variants by replacing exhaustive all-subsets testing with targeted tests guided by BOSS, yielding well-formed PAGs with higher precision and efficiency. Finally, we propose a simple heuristic, LV-Dumb (also known as BOSS-POD), which bypasses latent-variable-specific reasoning and directly returns the PAG of the BOSS DAG. Although not strictly correct in the FCI sense, it scales better and often achieves superior accuracy in practice. Simulations and real-data analyses demonstrate that BOSS-FCI and GRaSP-FCI provide sound baselines, FCIT improves both efficiency and reliability, and LV-Dumb offers a practical heuristic with strong empirical performance. Together, these method highlight the value of score-guided and targeted strategies for scalable latent-variable causal discovery.

[857] Influence branching for learning to solve mixed-integer programs online

Paul Strang, Zacharie Alès, Côme Bissuel, Olivier Juan, Safia Kedad-Sidhoum, Emmanuel Rachelson

Main category: cs.LG

TL;DR: A new online learning approach called influence branching is introduced for solving Mixed Integer Programs (MIPs), using Thompson sampling to optimize graph-based variable selection strategies during branch and bound.

DetailsMotivation: To develop an improved online learning method for solving MIPs that can adapt to variations in problem structure and generalize well across different problem instances.

Method: Influence branching - a graph-oriented variable selection strategy applied in early branch and bound iterations, optimized online using Thompson sampling to rank graph representations based on computational speed improvements over SCIP.

Result: Achieved performance comparable to state-of-the-art online learning methods, with good generalization to problems with varying constraint matrices, constraint vectors, and objective coefficients when more samples are available.

Conclusion: The proposed influence branching method with Thompson sampling provides an effective online learning approach for MIP solving that generalizes well across different problem variations.

Abstract: On the occasion of the 20th Mixed Integer Program Workshop’s computational competition, this work introduces a new approach for learning to solve MIPs online. Influence branching, a new graph-oriented variable selection strategy, is applied throughout the first iterations of the branch and bound algorithm. This branching heuristic is optimized online with Thompson sampling, which ranks the best graph representations of MIP’s structure according to computational speed up over SCIP. We achieve results comparable to state of the art online learning methods. Moreover, our results indicate that our method generalizes well to more general online frameworks, where variations in constraint matrix, constraint vector and objective coefficients can all occur and where more samples are available.

[858] A KL-regularization framework for learning to plan with adaptive priors

Álvaro Serra-Gomez, Daniel Jarne Ornia, Dhruva Tirumala, Thomas Moerland

Main category: cs.LG

TL;DR: PO-MPC is a unified framework for MPPI-based reinforcement learning that integrates planner’s action distribution as a prior in policy optimization, improving exploration and performance in high-dimensional continuous control tasks.

DetailsMotivation: To address the challenge of effective exploration in model-based reinforcement learning, particularly in high-dimensional continuous control where sample efficiency is crucial, by better aligning learned policies with planner distributions.

Method: Introduces Policy Optimization-Model Predictive Control (PO-MPC), a family of KL-regularized MBRL methods that use the planner’s action distribution as a prior in policy optimization, allowing flexible trade-offs between return maximization and KL divergence minimization.

Result: The unified framework shows that prior approaches emerge as special cases, and exploring new variations yields significant performance improvements, advancing state of the art in MPPI-based RL.

Conclusion: PO-MPC successfully unifies MPPI-based reinforcement learning methods and demonstrates that better alignment between learned policies and planner distributions leads to improved performance in model-based RL.

Abstract: Effective exploration remains a central challenge in model-based reinforcement learning (MBRL), particularly in high-dimensional continuous control tasks where sample efficiency is crucial. A prominent line of recent work leverages learned policies as proposal distributions for Model-Predictive Path Integral (MPPI) planning. Initial approaches update the sampling policy independently of the planner distribution, typically maximizing a learned value function with deterministic policy gradient and entropy regularization. However, because the states encountered during training depend on the MPPI planner, aligning the sampling policy with the planner improves the accuracy of value estimation and long-term performance. To this end, recent methods update the sampling policy by minimizing KL divergence to the planner distribution or by introducing planner-guided regularization into the policy update. In this work, we unify these MPPI-based reinforcement learning methods under a single framework by introducing Policy Optimization-Model Predictive Control (PO-MPC), a family of KL-regularized MBRL methods that integrate the planner’s action distribution as a prior in policy optimization. By aligning the learned policy with the planner’s behavior, PO-MPC allows more flexibility in the policy updates to trade off Return maximization and KL divergence minimization. We clarify how prior approaches emerge as special cases of this family, and we explore previously unstudied variations. Our experiments show that these extended configurations yield significant performance improvements, advancing the state of the art in MPPI-based RL.

[859] FairAgent: Democratizing Fairness-Aware Machine Learning with LLM-Powered Agents

Yucong Dai, Lu Zhang, Feng Luo, Mashrur Chowdhury, Yongkai Wu

Main category: cs.LG

TL;DR: FairAgent is an LLM-powered automated system that simplifies fairness-aware model development by automatically handling bias analysis, data preprocessing, and bias mitigation without requiring deep technical expertise.

DetailsMotivation: Fair and unbiased machine learning models are crucial for high-stakes applications, but current approaches require deep expertise in fairness definitions, metrics, and techniques, making them inaccessible to many practitioners.

Method: FairAgent uses LLM-powered automation to analyze datasets for biases, handle data preprocessing and feature engineering, and implement appropriate bias mitigation strategies based on user requirements.

Result: Experiments show FairAgent achieves significant performance improvements while substantially reducing development time and expertise requirements.

Conclusion: FairAgent makes fairness-aware machine learning more accessible to practitioners by automating complex fairness development processes.

Abstract: Training fair and unbiased machine learning models is crucial for high-stakes applications, yet it presents significant challenges. Effective bias mitigation requires deep expertise in fairness definitions, metrics, data preprocessing, and machine learning techniques. In addition, the complex process of balancing model performance with fairness requirements while properly handling sensitive attributes makes fairness-aware model development inaccessible to many practitioners. To address these challenges, we introduce FairAgent, an LLM-powered automated system that significantly simplifies fairness-aware model development. FairAgent eliminates the need for deep technical expertise by automatically analyzing datasets for potential biases, handling data preprocessing and feature engineering, and implementing appropriate bias mitigation strategies based on user requirements. Our experiments demonstrate that FairAgent achieves significant performance improvements while significantly reducing development time and expertise requirements, making fairness-aware machine learning more accessible to practitioners.

[860] HoRA: Cross-Head Low-Rank Adaptation with Joint Hypernetworks

Nghiem T. Diep, Dung Le, Tuan Truong, Tan Dinh, Huy Nguyen, Nhat Ho

Main category: cs.LG

TL;DR: HoRA is a parameter-efficient fine-tuning method that improves upon LoRA by using hypernetworks to generate low-rank matrices across attention heads, enabling cross-head information sharing and better performance with minimal parameter increase.

DetailsMotivation: LoRA adapts each attention head separately in multi-head self-attention, missing potential synergies across different heads. This limitation reduces efficiency and performance in fine-tuning large pre-trained models.

Method: Hyper-shared Low-Rank Adaptation (HoRA) uses joint hypernetworks to generate low-rank matrices across attention heads, coupling their adaptation through a shared generator to enable cross-head information sharing.

Result: Theoretical analysis shows HoRA achieves superior sample efficiency compared to LoRA. Extensive experiments across language and vision benchmarks demonstrate HoRA outperforms LoRA and other PEFT methods with only marginal parameter increase.

Conclusion: HoRA effectively addresses LoRA’s limitation by enabling cross-head information sharing through hypernetworks, delivering better performance and sample efficiency while maintaining parameter efficiency.

Abstract: Low-Rank Adaptation (LoRA) is a parameter-efficient fine-tuning (PEFT) technique that adapts large pre-trained models by adding low-rank matrices to their weight updates. However, in the context of fine-tuning multi-head self-attention (MHA), LoRA has been employed to adapt each attention head separately, thereby overlooking potential synergies across different heads. To mitigate this issue, we propose a novel Hyper-shared Low-Rank Adaptation (HoRA) method, which utilizes joint hypernetworks to generate low-rank matrices across attention heads. By coupling their adaptation through a shared generator, HoRA encourages cross-head information sharing, and thus directly addresses the aforementioned limitation of LoRA. By comparing LoRA and HoRA through the lens of hierarchical mixture of experts, our theoretical findings reveal that the latter achieves superior sample efficiency to the former. Furthermore, through extensive experiments across diverse language and vision benchmarks, we demonstrate that HoRA outperforms LoRA and other PEFT methods while requiring only a marginal increase in the number of trainable parameters.

[861] Critical appraisal of artificial intelligence for rare-event recognition: principles and pharmacovigilance case studies

G. Niklas Noren, Eva-Lisa Meldau, Johan Ellenius

Main category: cs.LG

TL;DR: This paper provides a framework for evaluating AI models in rare-event recognition scenarios, addressing challenges like limited real-world value despite apparent accuracy, with specific focus on problem framing, statistical evaluation, and practical implementation.

DetailsMotivation: Many AI applications target low-prevalence events where traditional accuracy metrics can be misleading, concealing limited real-world value. The need arises for proper evaluation frameworks in domains where positives are scarce and error costs may be asymmetric.

Method: Proposes structured case-level examination (SCLE) to complement statistical performance evaluation, along with a comprehensive checklist for AI model procurement/development. Framework includes problem framing, test set design, prevalence-aware statistical evaluation, robustness assessment, and human workflow integration.

Result: The framework is instantiated in pharmacovigilance through three studies: rule-based retrieval of pregnancy-related reports, duplicate detection combining ML with probabilistic record linkage, and automated redaction using LLMs. Shows how cost-sensitive targets align model performance with operational value.

Conclusion: While grounded in pharmacovigilance practice, the principles generalize to domains where positives are scarce and error costs may be asymmetric, providing guidance for critical appraisal of AI in rare-event recognition.

Abstract: Many high-stakes AI applications target low-prevalence events, where apparent accuracy can conceal limited real-world value. Relevant AI models range from expert-defined rules and traditional machine learning to generative LLMs constrained for classification. We outline key considerations for critical appraisal of AI in rare-event recognition, including problem framing and test set design, prevalence-aware statistical evaluation, robustness assessment, and integration into human workflows. In addition, we propose an approach to structured case-level examination (SCLE), to complement statistical performance evaluation, and a comprehensive checklist to guide procurement or development of AI models for rare-event recognition. We instantiate the framework in pharmacovigilance, drawing on three studies: rule-based retrieval of pregnancy-related reports; duplicate detection combining machine learning with probabilistic record linkage; and automated redaction of person names using an LLM. We highlight pitfalls specific to the rare-event setting including optimism from unrealistic class balance and lack of difficult positive controls in test sets - and show how cost-sensitive targets align model performance with operational value. While grounded in pharmacovigilance practice, the principles generalize to domains where positives are scarce and error costs may be asymmetric.

[862] Activation Steering with a Feedback Controller

Dung V. Nguyen, Hieu M. Vu, Nhi Y. Pham, Lei Zhang, Tan M. Nguyen

Main category: cs.LG

TL;DR: PID Steering: A control-theoretic framework for LLM activation steering using PID controllers that outperforms existing methods with better stability and robustness.

DetailsMotivation: Existing LLM steering methods lack theoretical guarantees and are primarily empirical. There's a need for principled approaches with performance guarantees.

Method: Proposed PID Steering framework that uses proportional (P), integral (I), and derivative (D) controllers for activation steering. P term aligns activations, I term accumulates errors for persistent corrections, and D term mitigates overshoot.

Result: Extensive experiments across multiple LLM families and benchmarks show PID Steering consistently outperforms existing approaches, achieving more robust and reliable behavioral control.

Conclusion: PID Steering provides a principled control-theoretic foundation for LLM activation steering with interpretable error dynamics and classical stability guarantees, while being lightweight and modular.

Abstract: Controlling the behaviors of large language models (LLM) is fundamental to their safety alignment and reliable deployment. However, existing steering methods are primarily driven by empirical insights and lack theoretical performance guarantees. In this work, we develop a control-theoretic foundation for activation steering by showing that popular steering methods correspond to the proportional (P) controllers, with the steering vector serving as the feedback signal. Building on this finding, we propose Proportional-Integral-Derivative (PID) Steering, a principled framework that leverages the full PID controller for activation steering in LLMs. The proportional (P) term aligns activations with target semantic directions, the integral (I) term accumulates errors to enforce persistent corrections across layers, and the derivative (D) term mitigates overshoot by counteracting rapid activation changes. This closed-loop design yields interpretable error dynamics and connects activation steering to classical stability guarantees in control theory. Moreover, PID Steering is lightweight, modular, and readily integrates with state-of-the-art steering methods. Extensive experiments across multiple LLM families and benchmarks demonstrate that PID Steering consistently outperforms existing approaches, achieving more robust and reliable behavioral control.

[863] Crash Severity Prediction Using Deep Learning Approaches: A Hybrid CNN-RNN Framework

Sahar Koohfar

Main category: cs.LG

TL;DR: A hybrid CNN-RNN deep learning model outperforms traditional statistical and machine learning methods in predicting traffic crash severity using 15,870 accident records from Virginia highways.

DetailsMotivation: Accurate and timely prediction of crash severity is crucial for intelligent transportation systems to provide appropriate medical assistance and transportation services, helping mitigate severe consequences of traffic accidents.

Method: Implemented a hybrid CNN-RNN deep learning model and compared it against logistic regression, naive bayes, KNN, decision tree, and individual RNN and CNN models. The methodology considers interconnected relationships between traffic accident features using 15,870 accident records from 2015-2021 on Virginia highway I-64.

Result: The proposed CNN-RNN hybrid model outperformed all benchmark models in predicting crash severity, demonstrating superior accuracy compared to traditional statistical and machine learning approaches.

Conclusion: The hybrid CNN-RNN model effectively combines the advantages of both RNN and CNN models to achieve greater accuracy in crash severity prediction, showing its effectiveness for intelligent transportation systems.

Abstract: Accurate and timely prediction of crash severity is crucial in mitigating the severe consequences of traffic accidents. Accurate and timely prediction of crash severity is crucial in mitigating the severe consequences of traffic accidents. In order to provide appropriate levels of medical assistance and transportation services, an intelligent transportation system relies on effective prediction methods. Deep learning models have gained popularity in this domain due to their capability to capture non-linear relationships among variables. In this research, we have implemented a hybrid CNN-RNN deep learning model for crash severity prediction and compared its performance against widely used statistical and machine learning models such as logistic regression, na"ive bayes classifier, K-Nearest Neighbors (KNN), decision tree, and individual deep learning models: RNN and CNN. This study employs a methodology that considers the interconnected relationships between various features of traffic accidents. The study was conducted using a dataset of 15,870 accident records gathered over a period of seven years between 2015 and 2021 on Virginia highway I-64. The findings demonstrate that the proposed CNN-RNN hybrid model has outperformed all benchmark models in terms of predicting crash severity. This result illustrates the effectiveness of the hybrid model as it combines the advantages of both RNN and CNN models in order to achieve greater accuracy in the prediction process.

[864] FoilDiff: A Hybrid Transformer Backbone for Diffusion-based Modelling of 2D Airfoil Flow Fields

Kenechukwu Ogbuagu, Sepehr Maleki, Giuseppe Bruni, Senthil Krishnababu

Main category: cs.LG

TL;DR: FoilDiff is a diffusion-based surrogate model with hybrid CNN-transformer backbone for accurate and efficient airfoil flow field prediction, reducing errors by up to 85% compared to state-of-the-art models.

DetailsMotivation: CFD models are computationally expensive for aerodynamic design optimization, creating need for faster surrogate models. Diffusion models show promise for complex flow field prediction but need improved accuracy and efficiency.

Method: Proposed FoilDiff with hybrid CNN-transformer denoising network combining convolutional feature extraction and transformer global attention. Uses DDIM sampling for efficiency and encoded Reynolds number, angle of attack, and airfoil geometry inputs for generalization.

Result: Significant performance improvements with mean prediction errors reduced by up to 85% on same datasets. Provides more accurate predictions and better-calibrated predictive uncertainty than existing diffusion-based models.

Conclusion: FoilDiff demonstrates superior accuracy and uncertainty calibration for airfoil flow field prediction, offering an efficient alternative to expensive CFD simulations for aerodynamic optimization.

Abstract: The accurate prediction of flow fields around airfoils is crucial for aerodynamic design and optimisation. Computational Fluid Dynamics (CFD) models are effective but computationally expensive, thus inspiring the development of surrogate models to enable quicker predictions. These surrogate models can be based on deep learning architectures, such as Convolutional Neural Networks (CNNs), Graph Neural Networks (GNNs), and Diffusion Models (DMs). Diffusion models have shown significant promise in predicting complex flow fields. In this work, we propose FoilDiff, a diffusion-based surrogate model with a hybrid-backbone denoising network. This hybrid design combines the power of convolutional feature extraction and transformer-based global attention to generate more adaptable and accurate representations of flow structures. FoilDiff takes advantage of Denoising Diffusion Implicit Model (DDIM) sampling to optimise the efficiency of the sampling process at no additional cost to model generalisation. We used encoded representations of Reynolds number, angle of attack, and airfoil geometry to define the input space for generalisation across a wide range of aerodynamic conditions. When evaluated against state-of-the-art models, FoilDiff shows significant performance improvements, with mean prediction errors reducing by up to 85% on the same datasets. The results have demonstrated that FoilDiff can provide both more accurate predictions and better-calibrated predictive uncertainty than existing diffusion-based models.

[865] GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks

Tejal Patwardhan, Rachel Dias, Elizabeth Proehl, Grace Kim, Michele Wang, Olivia Watkins, Simón Posada Fishman, Marwan Aljubeh, Phoebe Thacker, Laurance Fauconnet, Natalie S. Kim, Patrick Chao, Samuel Miserendino, Gildas Chabot, David Li, Michael Sharman, Alexandra Barr, Amelia Glaese, Jerry Tworek

Main category: cs.LG

TL;DR: GDPval is a benchmark evaluating AI models on real-world economic tasks across 44 occupations in top U.S. GDP sectors, showing frontier models are approaching expert quality and can perform tasks cheaper/faster with human oversight.

DetailsMotivation: To create a comprehensive benchmark that evaluates AI model capabilities on economically valuable real-world tasks, covering the majority of U.S. Bureau of Labor Statistics Work Activities across key GDP sectors.

Method: Constructed tasks from representative work of industry professionals with average 14 years experience, covering 44 occupations across top 9 U.S. GDP sectors. Used automated grading service and open-sourced 220 gold tasks.

Result: Frontier model performance is improving linearly over time, with current best models approaching industry expert quality. Models with human oversight can perform tasks cheaper and faster than unaided experts. Increased reasoning effort, task context, and scaffolding improves performance.

Conclusion: GDPval provides a valuable benchmark for understanding real-world AI capabilities, showing models are becoming increasingly capable of performing economically valuable tasks, with potential for cost-effective human-AI collaboration.

Abstract: We introduce GDPval, a benchmark evaluating AI model capabilities on real-world economically valuable tasks. GDPval covers the majority of U.S. Bureau of Labor Statistics Work Activities for 44 occupations across the top 9 sectors contributing to U.S. GDP (Gross Domestic Product). Tasks are constructed from the representative work of industry professionals with an average of 14 years of experience. We find that frontier model performance on GDPval is improving roughly linearly over time, and that the current best frontier models are approaching industry experts in deliverable quality. We analyze the potential for frontier models, when paired with human oversight, to perform GDPval tasks cheaper and faster than unaided experts. We also demonstrate that increased reasoning effort, increased task context, and increased scaffolding improves model performance on GDPval. Finally, we open-source a gold subset of 220 tasks and provide a public automated grading service at evals.openai.com to facilitate future research in understanding real-world model capabilities.

[866] Arithmetic-Mean $μ$P for Modern Architectures: A Unified Learning-Rate Scale for CNNs and ResNets

Haosong Zhang, Shenxi Wu, Yichi Zhang, Wei Lin

Main category: cs.LG

TL;DR: AM-μP introduces a network-wide average pre-activation constraint and residual-aware initialization to solve learning rate scaling issues in deep heterogeneous architectures, establishing a universal L^{-3/2} scaling law for convolutional and residual networks.

DetailsMotivation: Classical μP parameterization fails in heterogeneous architectures with residual connections and convolutions due to layer imbalance, requiring a more robust approach for learning rate scaling across depths.

Method: Arithmetic-Mean μP (AM-μP) constrains the network-wide average pre-activation second moment to constant scale, combined with residual-aware He initialization that scales weights by number of blocks.

Result: Proves η*(L) ∝ L^{-3/2} for convolutional networks and Θ(L^{-3/2}) for residual networks, with empirical validation across depths enabling zero-shot learning rate transfer.

Conclusion: AM-μP provides a unified and practical learning rate principle for convolutional and deep residual networks without additional tuning overhead.

Abstract: Choosing an appropriate learning rate remains a key challenge in scaling depth of modern deep networks. The classical maximal update parameterization ($\mu$P) enforces a fixed per-layer update magnitude, which is well suited to homogeneous multilayer perceptrons (MLPs) but becomes ill-posed in heterogeneous architectures where residual accumulation and convolutions introduce imbalance across layers. We introduce Arithmetic-Mean $\mu$P (AM-$\mu$P), which constrains not each individual layer but the network-wide average one-step pre-activation second moment to a constant scale. Combined with a residual-aware He fan-in initialization - scaling residual-branch weights by the number of blocks ($\mathrm{Var}[W]=c/(K\cdot \mathrm{fan\text{-}in})$) - AM-$\mu$P yields width-robust depth laws that transfer consistently across depths. We prove that, for one- and two-dimensional convolutional networks, the maximal-update learning rate satisfies $\eta^\star(L)\propto L^{-3/2}$; with zero padding, boundary effects are constant-level as $N\gg k$. For standard residual networks with general conv+MLP blocks, we establish $\eta^\star(L)=\Theta(L^{-3/2})$, with $L$ the minimal depth. Empirical results across a range of depths confirm the $-3/2$ scaling law and enable zero-shot learning-rate transfer, providing a unified and practical LR principle for convolutional and deep residual networks without additional tuning overhead.

[867] Adaptive Weighted Loss for Sequential Recommendations on Sparse Domains

Akshay Mittal, Vinay Venkatesh, Krishna Kandi, Shalini Sudarshan

Main category: cs.LG

TL;DR: Proposes a dynamic weighted loss function for sequential recommendation that adaptively adjusts weights based on domain sparsity, outperforming static weighting methods especially for sparse domains.

DetailsMotivation: Static weighted loss in PinnerFormerLite is suboptimal for sparse domains where training signals get diluted by generic data. Need adaptive weighting to handle varying domain densities.

Method: Dynamic weighted loss with adaptive algorithm that assigns higher weights to sparser domains and lower weights to denser domains, supported by theoretical analysis including convergence proofs and complexity analysis.

Result: Significantly outperforms state-of-the-art baselines (SIGMA, CALRec, SparseEnNet) across four datasets, achieving substantial lifts in Recall@10 and NDCG@10 for sparse domains while maintaining performance on dense domains with minimal computational overhead.

Conclusion: The dynamic weighting approach effectively handles domain sparsity variations, ensuring meaningful gradient signals for rare user interests without compromising performance on popular domains.

Abstract: The effectiveness of single-model sequential recommendation architectures, while scalable, is often limited when catering to “power users” in sparse or niche domains. Our previous research, PinnerFormerLite, addressed this by using a fixed weighted loss to prioritize specific domains. However, this approach can be sub-optimal, as a single, uniform weight may not be sufficient for domains with very few interactions, where the training signal is easily diluted by the vast, generic dataset. This paper proposes a novel, data-driven approach: a Dynamic Weighted Loss function with comprehensive theoretical foundations and extensive empirical validation. We introduce an adaptive algorithm that adjusts the loss weight for each domain based on its sparsity in the training data, assigning a higher weight to sparser domains and a lower weight to denser ones. This ensures that even rare user interests contribute a meaningful gradient signal, preventing them from being overshadowed. We provide rigorous theoretical analysis including convergence proofs, complexity analysis, and bounds analysis to establish the stability and efficiency of our approach. Our comprehensive empirical validation across four diverse datasets (MovieLens, Amazon Electronics, Yelp Business, LastFM Music) with state-of-the-art baselines (SIGMA, CALRec, SparseEnNet) demonstrates that this dynamic weighting system significantly outperforms all comparison methods, particularly for sparse domains, achieving substantial lifts in key metrics like Recall at 10 and NDCG at 10 while maintaining performance on denser domains and introducing minimal computational overhead.

[868] DoRAN: Stabilizing Weight-Decomposed Low-Rank Adaptation via Noise Injection and Auxiliary Networks

Nghiem T. Diep, Hien Dang, Tuan Truong, Tan Dinh, Huy Nguyen, Nhat Ho

Main category: cs.LG

TL;DR: DoRAN improves DoRA by adding noise-based regularization and dynamic parameter generation for more stable and sample-efficient fine-tuning.

DetailsMotivation: To address training instability and improve sample efficiency in parameter-efficient fine-tuning methods, particularly building on DoRA's limitations.

Method: Two-stage approach: (1) inject noise into DoRA’s denominator for adaptive regularization, (2) replace static low-rank matrices with auxiliary networks for dynamic parameter generation with cross-layer coupling.

Result: Consistently outperforms LoRA, DoRA, and other PEFT baselines on vision and language benchmarks.

Conclusion: Combining noise-based stabilization with network-based parameter generation provides robust and efficient fine-tuning for foundation models.

Abstract: Parameter-efficient fine-tuning (PEFT) methods have become the standard paradigm for adapting large-scale models. Among these techniques, Weight-Decomposed Low-Rank Adaptation (DoRA) has been shown to improve both the learning capacity and training stability of the vanilla Low-Rank Adaptation (LoRA) method by explicitly decomposing pre-trained weights into magnitude and directional components. In this work, we propose DoRAN, a new variant of DoRA designed to further stabilize training and boost the sample efficiency of DoRA. Our approach includes two key stages: (i) injecting noise into the denominator of DoRA’s weight decomposition, which serves as an adaptive regularizer to mitigate instabilities; and (ii) replacing static low-rank matrices with auxiliary networks that generate them dynamically, enabling parameter coupling across layers and yielding better sample efficiency in both theory and practice. Comprehensive experiments on vision and language benchmarks show that DoRAN consistently outperforms LoRA, DoRA, and other PEFT baselines. These results underscore the effectiveness of combining stabilization through noise-based regularization with network-based parameter generation, offering a promising direction for robust and efficient fine-tuning of foundation models.

[869] Learning to Predict Chaos: Curriculum-Driven Training for Robust Forecasting of Chaotic Dynamics

Harshil Vejendla

Main category: cs.LG

TL;DR: CCF is a curriculum learning approach for chaotic system forecasting that organizes training from simple periodic to complex chaotic systems using Lyapunov exponents and attractor dimensions, improving prediction horizons by up to 40% on real-world benchmarks.

DetailsMotivation: Existing ML approaches either over-specialize on single chaotic systems or mix unrelated time-series, failing to learn nuanced dynamical regimes. There's a need for training methods that enable robust and generalizable representations of chaotic behaviors.

Method: Curriculum Chaos Forecasting (CCF) organizes training data based on dynamical systems theory, progressing from simple periodic behaviors to complex chaotic dynamics. Uses largest Lyapunov exponent and attractor dimension to quantify complexity. Trains on over 50 synthetic ODE/PDE systems in curriculum order.

Result: CCF extends valid prediction horizon by up to 40% compared to random-order training and more than doubles it compared to training on real-world data alone. Benefits are consistent across datasets (Sunspot numbers, electricity demand, ECG signals) and neural architectures (GRU, Transformer).

Conclusion: Curriculum-based training organized by dynamical complexity significantly improves chaotic system forecasting performance and generalizability, demonstrating the importance of structured learning progression in capturing complex temporal dynamics.

Abstract: Forecasting chaotic systems is a cornerstone challenge in many scientific fields, complicated by the exponential amplification of even infinitesimal prediction errors. Modern machine learning approaches often falter due to two opposing pitfalls: over-specializing on a single, well-known chaotic system (e.g., Lorenz-63), which limits generalizability, or indiscriminately mixing vast, unrelated time-series, which prevents the model from learning the nuances of any specific dynamical regime. We propose Curriculum Chaos Forecasting (CCF), a training paradigm that bridges this gap. CCF organizes training data based on fundamental principles of dynamical systems theory, creating a curriculum that progresses from simple, periodic behaviors to highly complex, chaotic dynamics. We quantify complexity using the largest Lyapunov exponent and attractor dimension, two well-established metrics of chaos. By first training a sequence model on predictable systems and gradually introducing more chaotic trajectories, CCF enables the model to build a robust and generalizable representation of dynamical behaviors. We curate a library of over 50 synthetic ODE/PDE systems to build this curriculum. Our experiments show that pre-training with CCF significantly enhances performance on unseen, real-world benchmarks. On datasets including Sunspot numbers, electricity demand, and human ECG signals, CCF extends the valid prediction horizon by up to 40% compared to random-order training and more than doubles it compared to training on real-world data alone. We demonstrate that this benefit is consistent across various neural architectures (GRU, Transformer) and provide extensive ablations to validate the importance of the curriculum’s structure.

[870] Real-time Prediction of Urban Sound Propagation with Conditioned Normalizing Flows

Achim Eckerle, Martin Spitznagel, Janis Keuper

Main category: cs.LG

TL;DR: The paper presents a conditional Normalizing Flows model (Full-Glow) that generates urban sound-pressure maps from 2D urban layouts in real time, achieving >2000x speedup over physics-based solvers while improving accuracy.

DetailsMotivation: Urban noise prediction is crucial for public health and regulatory workflows, but traditional physics-based solvers are too slow for time-critical iterative studies required by the Environmental Noise Directive.

Method: Uses conditional Normalizing Flows (Full-Glow) to generate standards-compliant urban sound-pressure maps from 2D urban layouts, enabling real-time computation on commodity hardware (RTX 4090).

Result: Achieves >2000x speedup over reference solver, improves NLoS accuracy by up to 24% versus prior deep models, reaches 0.65 dB MAE in Baseline NLoS with high structural fidelity, and reproduces diffraction and interference patterns.

Conclusion: The model enables interactive exploration and instant recomputation under source or geometry changes, making it practical for urban planning, compliance mapping, and operational assessments.

Abstract: Accurate and fast urban noise prediction is pivotal for public health and for regulatory workflows in cities, where the Environmental Noise Directive mandates regular strategic noise maps and action plans, often needed in permission workflows, right-of-way allocation, and construction scheduling. Physics-based solvers are too slow for such time-critical, iterative “what-if” studies. We evaluate conditional Normalizing Flows (Full-Glow) for generating for generating standards-compliant urban sound-pressure maps from 2D urban layouts in real time per 256x256 map on a single RTX 4090), enabling interactive exploration directly on commodity hardware. On datasets covering Baseline, Diffraction, and Reflection regimes, our model accelerates map generation by >2000 times over a reference solver while improving NLoS accuracy by up to 24% versus prior deep models; in Baseline NLoS we reach 0.65 dB MAE with high structural fidelity. The model reproduces diffraction and interference patterns and supports instant recomputation under source or geometry changes, making it a practical engine for urban planning, compliance mapping, and operations (e.g., temporary road closures, night-work variance assessments).

[871] From News to Returns: A Granger-Causal Hypergraph Transformer on the Sphere

Anoushka Harit, Zhongtian Sun, Jongmin Yu

Main category: cs.LG

TL;DR: CSHT is a novel financial forecasting model that combines Granger-causal hypergraphs, Riemannian geometry, and causally masked Transformer attention to provide interpretable predictions.

DetailsMotivation: To create an interpretable financial forecasting model that can handle directional influences from news/sentiment on asset returns while maintaining geometric consistency and causal structure.

Method: Models multivariate Granger-causal dependencies as directional hyperedges on a hypersphere, using angular masks to constrain Transformer attention for temporal directionality and geometric consistency.

Result: Outperforms baselines on S&P 500 data (2018-2023) across return prediction, regime classification, and top-asset ranking tasks, especially during the 2020 COVID-19 shock.

Conclusion: CSHT provides both robust generalization across market regimes and transparent attribution pathways, making it a principled solution for trustworthy financial forecasting under uncertainty.

Abstract: We propose the Causal Sphere Hypergraph Transformer (CSHT), a novel architecture for interpretable financial time-series forecasting that unifies \emph{Granger-causal hypergraph structure}, \emph{Riemannian geometry}, and \emph{causally masked Transformer attention}. CSHT models the directional influence of financial news and sentiment on asset returns by extracting multivariate Granger-causal dependencies, which are encoded as directional hyperedges on the surface of a hypersphere. Attention is constrained via angular masks that preserve both temporal directionality and geometric consistency. Evaluated on S&P 500 data from 2018 to 2023, including the 2020 COVID-19 shock, CSHT consistently outperforms baselines across return prediction, regime classification, and top-asset ranking tasks. By enforcing predictive causal structure and embedding variables in a Riemannian manifold, CSHT delivers both \emph{robust generalisation across market regimes} and \emph{transparent attribution pathways} from macroeconomic events to stock-level responses. These results suggest that CSHT is a principled and practical solution for trustworthy financial forecasting under uncertainty.

[872] Quantifying Ambiguity in Categorical Annotations: A Measure and Statistical Inference Framework

Christopher Klugmann, Daniel Kondermann

Main category: cs.LG

TL;DR: The paper introduces a new ambiguity measure for categorical annotations that distinguishes between class-level indistinguishability and explicit unresolvability, with statistical inference tools for population ambiguity estimation.

DetailsMotivation: Human categorical annotations often reflect genuine ambiguity rather than simple errors, requiring better ways to quantify aleatoric uncertainty in categorical tasks.

Method: Developed an ambiguity measure that treats ‘can’t solve’ category asymmetrically, analyzed its formal properties, and created frequentist point estimators and Bayesian posterior inference using Dirichlet priors.

Result: The measure effectively separates different types of uncertainty and provides principled statistical tools for ambiguity estimation, calibration, and dataset-quality assessment.

Conclusion: The proposed ambiguity measure and inference framework offer practical tools for understanding and quantifying uncertainty in categorical annotation tasks, with applications in dataset assessment and machine learning workflows.

Abstract: Human-generated categorical annotations frequently produce empirical response distributions (soft labels) that reflect ambiguity rather than simple annotator error. We introduce an ambiguity measure that maps a discrete response distribution to a scalar in the unit interval, designed to quantify aleatoric uncertainty in categorical tasks. The measure bears a close relationship to quadratic entropy (Gini-style impurity) but departs from those indices by treating an explicit “can’t solve” category asymmetrically, thereby separating uncertainty arising from class-level indistinguishability from uncertainty due to explicit unresolvability. We analyze the measure’s formal properties and contrast its behavior with a representative ambiguity measure from the literature. Moving beyond description, we develop statistical tools for inference: we propose frequentist point estimators for population ambiguity and derive the Bayesian posterior over ambiguity induced by Dirichlet priors on the underlying probability vector, providing a principled account of epistemic uncertainty. Numerical examples illustrate estimation, calibration, and practical use for dataset-quality assessment and downstream machine-learning workflows.

[873] Categorical Invariants of Learning Dynamics

Abdulrahman Tamim

Main category: cs.LG

TL;DR: Learning is a structure-preserving transformation between parameter and representation spaces, where training runs in the same homotopy class generalize similarly.

DetailsMotivation: To provide a fundamentally different perspective on neural network training beyond gradient descent, revealing categorical invariants that explain generalization and optimization behavior.

Method: Categorical framework treating learning as a functor between parameter and representation spaces, using homotopy theory, persistent homology, and 2-categorical structures to analyze optimization paths.

Result: Networks with homotopic training trajectories generalize within 0.5% accuracy, while non-homotopic paths differ by over 3%. Persistent homology predicts generalization with R^2 = 0.82 correlation.

Conclusion: Categorical invariants offer theoretical insights into why deep learning works and provide practical tools for training more robust networks through structure-preserving transformations.

Abstract: Neural network training is typically viewed as gradient descent on a loss surface. We propose a fundamentally different perspective: learning is a structure-preserving transformation (a functor L) between the space of network parameters (Param) and the space of learned representations (Rep). This categorical framework reveals that different training runs producing similar test performance often belong to the same homotopy class (continuous deformation family) of optimization paths. We show experimentally that networks converging via homotopic trajectories generalize within 0.5% accuracy of each other, while non-homotopic paths differ by over 3%. The theory provides practical tools: persistent homology identifies stable minima predictive of generalization (R^2 = 0.82 correlation), pullback constructions formalize transfer learning, and 2-categorical structures explain when different optimization algorithms yield functionally equivalent models. These categorical invariants offer both theoretical insight into why deep learning works and concrete algorithmic principles for training more robust networks.

[874] Post-training quantization of vision encoders needs prefixing registers

Seunghyeon Kim, Jinho Kim, Taesun Yeom, Wonpyo Park, Kyuyeun Kim, Jaeho Lee

Main category: cs.LG

TL;DR: Proposes RegCache, a training-free method to mitigate outliers in vision encoders for better quantization performance, using prefix tokens and token deletion techniques.

DetailsMotivation: Transformer-based vision encoders like CLIP are crucial for multimodal applications but face challenges in quantization due to massive-scale activations and outliers, especially at 8-bit precision.

Method: Introduces RegCache which adds outlier-prone but semantically meaningless prefix tokens to vision encoders to prevent other tokens from having outliers, using middle-layer prefixing and token deletion techniques.

Result: Consistently improves accuracy of quantized models across both text-supervised and self-supervised vision encoders.

Conclusion: RegCache effectively enables quantization of vision encoders with significantly smaller accuracy drops by addressing outlier behavior specific to vision models.

Abstract: Transformer-based vision encoders – such as CLIP – are central to multimodal intelligence, powering applications from autonomous web agents to robotic control. Since these applications often demand real-time processing of massive visual data, reducing the inference cost of vision encoders is critical. Post-training quantization offers a practical path, but remains challenging even at 8-bit precision due to massive-scale activations (i.e., outliers). In this work, we propose $\textit{RegCache}$, a training-free algorithm to mitigate outliers in vision encoders, enabling quantization with significantly smaller accuracy drops. The proposed RegCache introduces outlier-prone yet semantically meaningless prefix tokens to the target vision encoder, which prevents other tokens from having outliers. Notably, we observe that outliers in vision encoders behave differently from those in language models, motivating two technical innovations: middle-layer prefixing and token deletion. Experiments show that our method consistently improves the accuracy of quantized models across both text-supervised and self-supervised vision encoders.

[875] Toward a Unified Geometry Understanding: Riemannian Diffusion Framework for Graph Generation and Prediction

Yisen Gao, Xingcheng Fu, Qingyun Sun, Jianxin Li, Xianxian Li

Main category: cs.LG

TL;DR: GeoMancer is a Riemannian graph diffusion framework that addresses numerical instability and manifold deviation in graph diffusion models by using isometric-invariant Riemannian gyrokernels and manifold-constrained diffusion methods.

DetailsMotivation: Existing graph diffusion models embed features of different curvatures into a unified latent space without leveraging their geometric potential, leading to suboptimal performance due to the non-Euclidean nature of graph data.

Method: Proposes GeoMancer with two key innovations: 1) replaces exponential mapping with isometric-invariant Riemannian gyrokernel approach and decouples multi-level features onto task-specific manifolds, 2) introduces manifold-constrained diffusion method and self-guided strategy for unconditional generation.

Result: Extensive experiments demonstrate superior performance across various tasks compared to existing approaches.

Conclusion: GeoMancer effectively captures distinct manifold signatures of complex graph data and learns their distribution while maintaining numerical stability and manifold alignment during generation.

Abstract: Graph diffusion models have made significant progress in learning structured graph data and have demonstrated strong potential for predictive tasks. Existing approaches typically embed node, edge, and graph-level features into a unified latent space, modeling prediction tasks including classification and regression as a form of conditional generation. However, due to the non-Euclidean nature of graph data, features of different curvatures are entangled in the same latent space without releasing their geometric potential. To address this issue, we aim to construt an ideal Riemannian diffusion model to capture distinct manifold signatures of complex graph data and learn their distribution. This goal faces two challenges: numerical instability caused by exponential mapping during the encoding proces and manifold deviation during diffusion generation. To address these challenges, we propose GeoMancer: a novel Riemannian graph diffusion framework for both generation and prediction tasks. To mitigate numerical instability, we replace exponential mapping with an isometric-invariant Riemannian gyrokernel approach and decouple multi-level features onto their respective task-specific manifolds to learn optimal representations. To address manifold deviation, we introduce a manifold-constrained diffusion method and a self-guided strategy for unconditional generation, ensuring that the generated data remains aligned with the manifold signature. Extensive experiments validate the effectiveness of our approach, demonstrating superior performance across a variety of tasks.

[876] Score-based Greedy Search for Structure Identification of Partially Observed Linear Causal Models

Xinshuai Dong, Ignavier Ng, Haoyue Dai, Jiaqi Sun, Xiangchen Song, Peter Spirtes, Kun Zhang

Main category: cs.LG

TL;DR: Proposes the first score-based greedy search method for causal discovery with latent variables, called LGES, with identifiability guarantees.

DetailsMotivation: Constraint-based causal discovery methods face challenges with multiple testing and error propagation. Score-based methods could mitigate these issues but none existed for partially observed scenarios with latent variables.

Method: Proposed Generalized N Factor Model and designed Latent variable Greedy Equivalence Search (LGES) algorithm with well-defined operators for efficient graph space search.

Result: Established global consistency: true structure including latent variables can be identified up to Markov equivalence class using score. Experiments on synthetic and real-life data validate effectiveness.

Conclusion: LGES is an effective score-based greedy search method for causal discovery with latent variables, with identifiability guarantees and efficient performance.

Abstract: Identifying the structure of a partially observed causal system is essential to various scientific fields. Recent advances have focused on constraint-based causal discovery to solve this problem, and yet in practice these methods often face challenges related to multiple testing and error propagation. These issues could be mitigated by a score-based method and thus it has raised great attention whether there exists a score-based greedy search method that can handle the partially observed scenario. In this work, we propose the first score-based greedy search method for the identification of structure involving latent variables with identifiability guarantees. Specifically, we propose Generalized N Factor Model and establish the global consistency: the true structure including latent variables can be identified up to the Markov equivalence class by using score. We then design Latent variable Greedy Equivalence Search (LGES), a greedy search algorithm for this class of model with well-defined operators, which search very efficiently over the graph space to find the optimal structure. Our experiments on both synthetic and real-life data validate the effectiveness of our method (code will be publicly available).

[877] SSM-CGM: Interpretable State-Space Forecasting Model of Continuous Glucose Monitoring for Personalized Diabetes Management

Shakson Isaac, Yentl Collin, Chirag Patel

Main category: cs.LG

TL;DR: SSM-CGM is a Mamba-based neural state-space model that improves glucose forecasting accuracy and interpretability by integrating CGM and wearable data, enabling counterfactual analysis for diabetes management.

DetailsMotivation: Current CGM forecasting models lack interpretability needed for clinical use, limiting their practical application in diabetes management despite generating dense data streams.

Method: Developed SSM-CGM using Mamba-based neural state-space model that integrates continuous glucose monitoring (CGM) and wearable activity signals from the AI-READI cohort, incorporating variable selection and temporal attribution for interpretability.

Result: SSM-CGM improves short-term accuracy over Temporal Fusion Transformer baseline, provides interpretability through variable selection and temporal attribution, and enables counterfactual forecasts simulating effects of physiological signal changes.

Conclusion: SSM-CGM provides an interpretable, physiologically grounded framework for personalized diabetes management by combining improved forecasting accuracy with clinical interpretability features.

Abstract: Continuous glucose monitoring (CGM) generates dense data streams critical for diabetes management, but most used forecasting models lack interpretability for clinical use. We present SSM-CGM, a Mamba-based neural state-space forecasting model that integrates CGM and wearable activity signals from the AI-READI cohort. SSM-CGM improves short-term accuracy over a Temporal Fusion Transformer baseline, adds interpretability through variable selection and temporal attribution, and enables counterfactual forecasts simulating how planned changes in physiological signals (e.g., heart rate, respiration) affect near-term glucose. Together, these features make SSM-CGM an interpretable, physiologically grounded framework for personalized diabetes management.

[878] SONA: Learning Conditional, Unconditional, and Mismatching-Aware Discriminator

Yuhta Takida, Satoshi Hayakawa, Takashi Shibuya, Masaaki Imaizumi, Naoki Murata, Bac Nguyen, Toshimitsu Uesaka, Chieh-Hsin Lai, Yuki Mitsufuji

Main category: cs.LG

TL;DR: Proposes SONA, a novel discriminator design for conditional GANs that integrates unconditional discrimination, matching-aware supervision, and adaptive weighting to improve both sample authenticity and conditional alignment.

DetailsMotivation: Existing conditional GANs struggle to balance authenticity assessment and conditional alignment within their discriminators, limiting their performance in conditional generation tasks.

Method: Introduces SONA discriminator with separate projections for naturalness (authenticity) and alignment, using inductive bias, dedicated objective functions, and adaptive weighting mechanism to dynamically balance all objectives.

Result: Extensive experiments on class-conditional generation tasks show superior sample quality and conditional alignment compared to state-of-the-art methods. Also effective in text-to-image generation.

Conclusion: SONA provides a versatile and robust approach for conditional generation that effectively balances authenticity and alignment objectives through its novel discriminator design.

Abstract: Deep generative models have made significant advances in generating complex content, yet conditional generation remains a fundamental challenge. Existing conditional generative adversarial networks often struggle to balance the dual objectives of assessing authenticity and conditional alignment of input samples within their conditional discriminators. To address this, we propose a novel discriminator design that integrates three key capabilities: unconditional discrimination, matching-aware supervision to enhance alignment sensitivity, and adaptive weighting to dynamically balance all objectives. Specifically, we introduce Sum of Naturalness and Alignment (SONA), which employs separate projections for naturalness (authenticity) and alignment in the final layer with an inductive bias, supported by dedicated objective functions and an adaptive weighting mechanism. Extensive experiments on class-conditional generation tasks show that \ours achieves superior sample quality and conditional alignment compared to state-of-the-art methods. Furthermore, we demonstrate its effectiveness in text-to-image generation, confirming the versatility and robustness of our approach.

[879] GILT: An LLM-Free, Tuning-Free Graph Foundational Model for In-Context Learning

Weishuo Ma, Yanbo Wang, Xiyuan Wang, Lei Zou, Muhan Zhang

Main category: cs.LG

TL;DR: GILT is a Graph In-context Learning Transformer that addresses graph heterogeneity through a token-based framework for unified classification tasks without requiring LLMs or fine-tuning.

DetailsMotivation: Current Graph Foundational Models struggle with graph heterogeneity (unique features, labels, topologies) and face limitations: LLM-based approaches can't handle numerical features well, while structure-based models require costly per-graph tuning.

Method: GILT introduces a token-based framework for in-context learning on graphs, reframing node, edge, and graph classification tasks in a unified approach that operates on generic numerical features and dynamically understands class semantics from context.

Result: Comprehensive experiments show GILT achieves stronger few-shot performance with significantly less time than LLM-based or tuning-based baselines.

Conclusion: GILT provides an effective LLM-free and tuning-free solution for handling graph heterogeneity while maintaining strong performance and efficiency.

Abstract: Graph Neural Networks (GNNs) are powerful tools for precessing relational data but often struggle to generalize to unseen graphs, giving rise to the development of Graph Foundational Models (GFMs). However, current GFMs are challenged by the extreme heterogeneity of graph data, where each graph can possess a unique feature space, label set, and topology. To address this, two main paradigms have emerged. The first leverages Large Language Models (LLMs), but is fundamentally text-dependent, thus struggles to handle the numerical features in vast graphs. The second pre-trains a structure-based model, but the adaptation to new tasks typically requires a costly, per-graph tuning stage, creating a critical efficiency bottleneck. In this work, we move beyond these limitations and introduce \textbf{G}raph \textbf{I}n-context \textbf{L}earning \textbf{T}ransformer (GILT), a framework built on an LLM-free and tuning-free architecture. GILT introduces a novel token-based framework for in-context learning (ICL) on graphs, reframing classification tasks spanning node, edge and graph levels in a unified framework. This mechanism is the key to handling heterogeneity, as it is designed to operate on generic numerical features. Further, its ability to understand class semantics dynamically from the context enables tuning-free adaptation. Comprehensive experiments show that GILT achieves stronger few-shot performance with significantly less time than LLM-based or tuning-based baselines, validating the effectiveness of our approach.

[880] Achieve Performatively Optimal Policy for Performative Reinforcement Learning

Ziyi Chen, Heng Huang

Main category: cs.LG

TL;DR: This paper proposes a zeroth-order Frank-Wolfe algorithm (0-FW) for performative reinforcement learning that achieves polynomial-time convergence to the performatively optimal (PO) policy, overcoming the constant gap limitation of existing methods that only find performatively stable (PS) policies.

DetailsMotivation: Existing performative RL methods only find performatively stable policies that have a provable constant gap from the desired performatively optimal policy. This work aims to bridge this gap and achieve true optimality.

Method: The authors propose a zeroth-order Frank-Wolfe algorithm (0-FW) that uses zeroth-order approximation of the performative policy gradient within the Frank-Wolfe framework, combined with analysis showing the value function’s gradient dominance and boundedness properties.

Result: The algorithm achieves the first polynomial-time convergence to the desired performatively optimal policy under standard regularizer dominance conditions, outperforming existing methods.

Conclusion: The 0-FW algorithm effectively finds the performatively optimal policy by leveraging key properties of the value function and zeroth-order gradient approximation, providing both theoretical guarantees and empirical effectiveness.

Abstract: Performative reinforcement learning is an emerging dynamical decision making framework, which extends reinforcement learning to the common applications where the agent’s policy can change the environmental dynamics. Existing works on performative reinforcement learning only aim at a performatively stable (PS) policy that maximizes an approximate value function. However, there is a provably positive constant gap between the PS policy and the desired performatively optimal (PO) policy that maximizes the original value function. In contrast, this work proposes a zeroth-order Frank-Wolfe algorithm (0-FW) algorithm with a zeroth-order approximation of the performative policy gradient in the Frank-Wolfe framework, and obtains \textbf{the first polynomial-time convergence to the desired PO} policy under the standard regularizer dominance condition. For the convergence analysis, we prove two important properties of the nonconvex value function. First, when the policy regularizer dominates the environmental shift, the value function satisfies a certain gradient dominance property, so that any stationary point (not PS) of the value function is a desired PO. Second, though the value function has unbounded gradient, we prove that all the sufficiently stationary points lie in a convex and compact policy subspace $\Pi_{\Delta}$, where the policy value has a constant lower bound $\Delta>0$ and thus the gradient becomes bounded and Lipschitz continuous. Experimental results also demonstrate that our 0-FW algorithm is more effective than the existing algorithms in finding the desired PO policy.

[881] Trade-off in Estimating the Number of Byzantine Clients in Federated Learning

Ziyi Chen, Su Zhang, Heng Huang

Main category: cs.LG

TL;DR: This paper analyzes the impact of estimating Byzantine client count in federated learning, showing that underestimation leads to poor performance while overestimation causes performance degradation proportional to the overestimation.

DetailsMotivation: Federated learning is vulnerable to Byzantine clients, and robust aggregators require estimating the number of such clients. The effect of this estimation on performance hasn't been systematically studied.

Method: Theoretical analysis of worst-case error bounds for aggregators and federated learning algorithms under different cases of estimated vs actual Byzantine client counts.

Result: Underestimation (estimated count < actual count) leads to arbitrarily poor performance. For non-underestimation, optimal error bounds proportional to estimated_count/(total_clients - actual_count - estimated_count) are established.

Conclusion: There’s a fundamental trade-off: larger robustness degree allows handling more Byzantine clients but degrades performance when fewer are actually present, with error bounds increasing monotonically with overestimation.

Abstract: Federated learning has attracted increasing attention at recent large-scale optimization and machine learning research and applications, but is also vulnerable to Byzantine clients that can send any erroneous signals. Robust aggregators are commonly used to resist Byzantine clients. This usually requires to estimate the unknown number $f$ of Byzantine clients, and thus accordingly select the aggregators with proper degree of robustness (i.e., the maximum number $\hat{f}$ of Byzantine clients allowed by the aggregator). Such an estimation should have important effect on the performance, which has not been systematically studied to our knowledge. This work will fill in the gap by theoretically analyzing the worst-case error of aggregators as well as its induced federated learning algorithm for any cases of $\hat{f}$ and $f$. Specifically, we will show that underestimation ($\hat{f}<f$) can lead to arbitrarily poor performance for both aggregators and federated learning. For non-underestimation ($\hat{f}\ge f$), we have proved optimal lower and upper bounds of the same order on the errors of both aggregators and federated learning. All these optimal bounds are proportional to $\hat{f}/(n-f-\hat{f})$ with $n$ clients, which monotonically increases with larger $\hat{f}$. This indicates a fundamental trade-off: while an aggregator with a larger robustness degree $\hat{f}$ can solve federated learning problems of wider range $f\in [0,\hat{f}]$, the performance can deteriorate when there are actually fewer or even no Byzantine clients (i.e., $f\in [0,\hat{f})$).

[882] Fractional Heat Kernel for Semi-Supervised Graph Learning with Small Training Sample Size

Farid Bozorgnia, Vyacheslav Kungurtsev, Shirali Kadyrov, Mohsen Yousefnezhad

Main category: cs.LG

TL;DR: Novel algorithms for label propagation and self-training using fractional heat kernel dynamics with source term, integrated into Graph Neural Networks for enhanced expressiveness through adaptive multi-hop diffusion.

DetailsMotivation: Leverage the correspondence between information theory and physics of parabolic evolution equations to improve graph learning, particularly when only a small number of labeled training examples are available.

Method: Integrate fractional heat kernel into Graph Neural Network architectures (GCNs and Graph Attention) using Chebyshev polynomial approximations for computational feasibility on large graphs, with variational formulations for nonlocal interactions.

Result: The approach demonstrates effectiveness on standard datasets, showing improved performance through more globally diffusing labels via fractional Laplacian powers.

Conclusion: Fractional heat kernel dynamics provide a powerful framework for label propagation and self-training in graph neural networks, especially beneficial in low-label scenarios through enhanced nonlocal diffusion.

Abstract: In this work, we introduce novel algorithms for label propagation and self-training using fractional heat kernel dynamics with a source term. We motivate the methodology through the classical correspondence of information theory with the physics of parabolic evolution equations. We integrate the fractional heat kernel into Graph Neural Network architectures such as Graph Convolutional Networks and Graph Attention, enhancing their expressiveness through adaptive, multi-hop diffusion. By applying Chebyshev polynomial approximations, large graphs become computationally feasible. Motivating variational formulations demonstrate that by extending the classical diffusion model to fractional powers of the Laplacian, nonlocal interactions deliver more globally diffusing labels. The particular balance between supervision of known labels and diffusion across the graph is particularly advantageous in the case where only a small number of labeled training examples are present. We demonstrate the effectiveness of this approach on standard datasets.

[883] Domain Generalization: A Tale of Two ERMs

Yilun Zhu, Naihao Deng, Naichen Shi, Aditya Gangrade, Clayton Scott

Main category: cs.LG

TL;DR: Domain generalization performance depends on dataset characteristics - domain-informed ERM (augmenting features with domain info) outperforms standard ERM under posterior drift, while standard ERM works better under covariate shift.

DetailsMotivation: Previous DG literature found it difficult to outperform empirical risk minimization (ERM) on pooled training data, but this was primarily reported for datasets with covariate shift. The authors investigate whether different dataset assumptions affect DG performance.

Method: Proposed ‘domain-informed ERM’ where feature vectors are augmented with domain-specific information. Used theoretical framework and experiments on language and vision tasks to validate the approach.

Result: Domain-informed ERM outperforms pooling ERM when datasets satisfy posterior drift assumption, while standard ERM performs better under covariate shift assumption.

Conclusion: The effectiveness of domain generalization methods depends on the underlying dataset characteristics - posterior drift vs covariate shift assumptions determine whether domain-informed ERM or standard pooling ERM performs better.

Abstract: Domain generalization (DG) is the problem of generalizing from several distributions (or domains), for which labeled training data are available, to a new test domain for which no labeled data is available. A common finding in the DG literature is that it is difficult to outperform empirical risk minimization (ERM) on the pooled training data. In this work, we argue that this finding has primarily been reported for datasets satisfying a \emph{covariate shift} assumption. When the dataset satisfies a \emph{posterior drift} assumption instead, we show that ``domain-informed ERM,’’ wherein feature vectors are augmented with domain-specific information, outperforms pooling ERM. These claims are supported by a theoretical framework and experiments on language and vision tasks.

[884] Forking-Sequences

Willa Potosnak, Malcolm Wolff, Boris Oreshkin, Mengfei Cao, Michael W. Mahoney, Dmitry Efimov, Kin G. Olivares

Main category: cs.LG

TL;DR: Forking-sequences method improves forecast stability across forecast creation dates by jointly encoding/decoding time series across all dates, achieving 8.8-37.9% stability improvements across various neural architectures.

DetailsMotivation: Standard forecasting models treat each forecast creation date independently, leading to erratic revisions that undermine stakeholder trust and disrupt downstream decision-making, despite high accuracy.

Method: Forking-sequences approach jointly encodes and decodes the entire time series across all forecast creation dates, mirroring time series cross-validation, unlike standard methods that treat each date independently.

Result: Validated on 16 datasets from M1, M3, M4, and Tourism competitions, showing forecast percentage change stability improvements of 28.8%, 28.8%, 37.9%, 31.3%, and 8.8% on average for MLP, RNN, LSTM, CNN, and Transformer architectures respectively.

Conclusion: Forking-sequences provides three key benefits: more stable gradient updates during training, reduced forecast variance through ensembling, and improved inference computational efficiency, making a strong case for broader adoption in neural forecasting.

Abstract: While accuracy is a critical requirement for time series forecasting models, an equally important (yet often overlooked) desideratum is forecast stability across forecast creation dates (FCDs). Even highly accurate models can produce erratic revisions between FCDs, undermining stakeholder trust and disrupting downstream decision-making. To improve forecast stability, models like MQCNN, MQT, and SPADE employ a little-known but highly effective technique: forking-sequences. Unlike standard statistical and neural forecasting methods that treat each FCD independently, the forking-sequences method jointly encodes and decodes the entire time series across all FCDs, in a way mirroring time series cross-validation. Since forking sequences remains largely unknown in the broader neural forecasting community, in this work, we formalize the forking-sequences approach, and we make a case for its broader adoption. We demonstrate three key benefits of forking-sequences: (i) more stable and consistent gradient updates during training; (ii) reduced forecast variance through ensembling; and (iii) improved inference computational efficiency. We validate forking-sequences’ benefits using 16 datasets from the M1, M3, M4, and Tourism competitions, showing improvements in forecast percentage change stability of 28.8%, 28.8%, 37.9%, and 31.3%, and 8.8%, on average, for MLP, RNN, LSTM, CNN, and Transformer-based architectures, respectively.

[885] Expand Neurons, Not Parameters

Linghao Kong, Inimai Subramanian, Yonadav Shavit, Micah Adler, Dan Alistarh, Nir Shavit

Main category: cs.LG

TL;DR: Fixed Parameter Expansion (FPE) improves network performance by increasing neuron count without adding parameters, reducing feature interference through weight partitioning.

DetailsMotivation: To improve neural network performance by reducing feature entanglement and interference while maintaining the same number of non-zero parameters, addressing the bottleneck of memory movement in modern accelerators.

Method: Introduce Fixed Parameter Expansion (FPE): replace neurons with multiple children that inherit non-overlapping subsets of parent weights, creating disjoint connections without increasing parameter count.

Result: FPE systematically reduces polysemanticity metrics and increases accuracy on symbolic tasks, with benefits growing when interference is high. Similar gains achieved with random weight splits, indicating reduced collisions as primary driver.

Conclusion: Widening networks while maintaining constant non-zero parameters consistently improves performance by leveraging width against superposition, providing an interpretability-grounded mechanism well-suited for modern hardware bottlenecks.

Abstract: This work demonstrates how increasing the number of neurons in a network without increasing its number of non-zero parameters improves performance. We show that this gain corresponds with a decrease in interference between multiple features that would otherwise share the same neurons. To reduce such entanglement at a fixed non-zero parameter count, we introduce Fixed Parameter Expansion (FPE): replace a neuron with multiple children and partition the parent’s weights disjointly across them, so that each child inherits a non-overlapping subset of connections. On symbolic tasks, specifically Boolean code problems, clause-aligned FPE systematically reduces polysemanticity metrics and yields higher task accuracy. Notably, random splits of neuron weights approximate these gains, indicating that reduced collisions, not precise assignment, are a primary driver. Consistent with the superposition hypothesis, the benefits of FPE grow with increasing interference: when polysemantic load is high, accuracy improvements are the largest. Transferring these insights to real models (classifiers over CLIP embeddings and deeper multilayer networks) we find that widening networks while maintaining a constant non-zero parameter count consistently increases accuracy. These results identify an interpretability-grounded mechanism to leverage width against superposition, improving performance without increasing the number of non-zero parameters. Such a direction is well matched to modern accelerators, where memory movement of non-zero parameters, rather than raw compute, is the dominant bottleneck.

[886] Predictive Feature Caching for Training-free Acceleration of Molecular Geometry Generation

Johanna Sommer, John Rachwan, Nils Fleischmann, Stephan Günnemann, Bertrand Charpentier

Main category: cs.LG

TL;DR: A training-free caching method accelerates molecular geometry generation by predicting intermediate hidden states across solver steps, achieving 2-3x speedup with minimal quality loss.

DetailsMotivation: Flow matching models generate high-quality molecular geometries but have high computational costs during inference, requiring hundreds of network evaluations, which becomes a bottleneck when sampling large numbers of molecules.

Method: Proposes a training-free caching strategy that predicts intermediate hidden states across solver steps, operates directly on SE(3)-equivariant backbone, is compatible with pretrained models, and orthogonal to existing training-based accelerations.

Result: On GEOM-Drugs dataset, achieves 2x reduction in wall-clock inference time at matched sample quality, up to 3x speedup compared to base model with minimal quality degradation, and up to 7x speedup when combined with other optimizations.

Conclusion: The caching method effectively accelerates molecular geometry generation without requiring retraining, providing significant speedups that compound with other optimization techniques.

Abstract: Flow matching models generate high-fidelity molecular geometries but incur significant computational costs during inference, requiring hundreds of network evaluations. This inference overhead becomes the primary bottleneck when such models are employed in practice to sample large numbers of molecular candidates. This work discusses a training-free caching strategy that accelerates molecular geometry generation by predicting intermediate hidden states across solver steps. The proposed method operates directly on the SE(3)-equivariant backbone, is compatible with pretrained models, and is orthogonal to existing training-based accelerations and system-level optimizations. Experiments on the GEOM-Drugs dataset demonstrate that caching achieves a twofold reduction in wall-clock inference time at matched sample quality and a speedup of up to 3x compared to the base model with minimal sample quality degradation. Because these gains compound with other optimizations, applying caching alongside other general, lossless optimizations yield as much as a 7x speedup.

[887] Wavelet Predictive Representations for Non-Stationary Reinforcement Learning

Min Wang, Xin Li, Ye He, Yao-Hui Li, Hasnaa Bennis, Riashat Islam, Mingzhong Wang

Main category: cs.LG

TL;DR: WISDOM is a novel NSRL method that uses wavelet analysis to capture multi-scale features in evolving MDP sequences, enhancing agent adaptability in non-stationary environments through wavelet-domain task representations and a wavelet TD update operator.

DetailsMotivation: Real-world environments are inherently non-stationary with changing factors like weather and traffic flows. Existing NSRL approaches focus on regularly evolving patterns and have limited adaptability in highly dynamic settings.

Method: Proposes WISDOM which transforms task representation sequences into the wavelet domain to capture multi-scale features. Uses wavelet coefficients to represent global trends and fine-grained variations. Introduces a wavelet temporal difference (TD) update operator for enhanced tracking and prediction of MDP evolution.

Result: Experiments on diverse benchmarks show WISDOM significantly outperforms existing baselines in both sample efficiency and asymptotic performance. Demonstrates remarkable adaptability in complex non-stationary environments with stochastically evolving tasks.

Conclusion: WISDOM effectively enhances NSRL by leveraging wavelet analysis to capture multi-scale environmental dynamics, providing superior adaptability in non-stationary settings compared to existing approaches.

Abstract: The real world is inherently non-stationary, with ever-changing factors, such as weather conditions and traffic flows, making it challenging for agents to adapt to varying environmental dynamics. Non-Stationary Reinforcement Learning (NSRL) addresses this challenge by training agents to adapt rapidly to sequences of distinct Markov Decision Processes (MDPs). However, existing NSRL approaches often focus on tasks with regularly evolving patterns, leading to limited adaptability in highly dynamic settings. Inspired by the success of Wavelet analysis in time series modeling, specifically its ability to capture signal trends at multiple scales, we propose WISDOM to leverage wavelet-domain predictive task representations to enhance NSRL. WISDOM captures these multi-scale features in evolving MDP sequences by transforming task representation sequences into the wavelet domain, where wavelet coefficients represent both global trends and fine-grained variations of non-stationary changes. In addition to the auto-regressive modeling commonly employed in time series forecasting, we devise a wavelet temporal difference (TD) update operator to enhance tracking and prediction of MDP evolution. We theoretically prove the convergence of this operator and demonstrate policy improvement with wavelet task representations. Experiments on diverse benchmarks show that WISDOM significantly outperforms existing baselines in both sample efficiency and asymptotic performance, demonstrating its remarkable adaptability in complex environments characterized by non-stationary and stochastically evolving tasks.

[888] Noise or Signal? Deconstructing Contradictions and An Adaptive Remedy for Reversible Normalization in Time Series Forecasting

Fanzhe Fu, Yang Yang

Main category: cs.LG

TL;DR: Replacing RevIN’s statistics with robust counterparts (R²-IN) seems straightforward but reveals complex performance issues. While R²-IN prevents catastrophic failures on outlier datasets, adaptive models (A-IN) suffer complete systemic failure, showing that heuristic instability can be more damaging than the statistical problems they aim to solve.

DetailsMotivation: To understand why simple improvements to Reversible Instance Normalization (RevIN) don't work as expected, and to investigate the complex performance of various normalization strategies in time series forecasting.

Method: Deconstructed normalization strategies by identifying four theoretical contradictions and conducted experiments comparing standard RevIN, robust R²-IN, and adaptive A-IN models on datasets with extreme outliers.

Result: Standard RevIN catastrophically fails on datasets with extreme outliers (683% MSE surge), R²-IN unexpectedly emerges as best overall performer, while adaptive A-IN suffers complete systemic failure despite diagnostic-driven design.

Conclusion: Proposes a cautionary paradigm shift: from blind complexity search to diagnostics-driven analysis that reveals both the power of simple baselines and the perilous nature of naive adaptation in time series normalization.

Abstract: Reversible Instance Normalization (RevIN) is a key technique enabling simple linear models to achieve state-of-the-art performance in time series forecasting. While replacing its non-robust statistics with robust counterparts (termed R$^2$-IN) seems like a straightforward improvement, our findings reveal a far more complex reality. This paper deconstructs the perplexing performance of various normalization strategies by identifying four underlying theoretical contradictions. Our experiments provide two crucial findings: first, the standard RevIN catastrophically fails on datasets with extreme outliers, where its MSE surges by a staggering 683%. Second, while the simple R$^2$-IN prevents this failure and unexpectedly emerges as the best overall performer, our adaptive model (A-IN), designed to test a diagnostics-driven heuristic, unexpectedly suffers a complete and systemic failure. This surprising outcome uncovers a critical, overlooked pitfall in time series analysis: the instability introduced by a simple or counter-intuitive heuristic can be more damaging than the statistical issues it aims to solve. The core contribution of this work is thus a new, cautionary paradigm for time series normalization: a shift from a blind search for complexity to a diagnostics-driven analysis that reveals not only the surprising power of simple baselines but also the perilous nature of naive adaptation.

[889] Demystifying MaskGIT Sampler and Beyond: Adaptive Order Selection in Masked Diffusion

Satoshi Hayakawa, Yuhta Takida, Masaaki Imaizumi, Hiromi Wakaki, Yuki Mitsufuji

Main category: cs.LG

TL;DR: The paper introduces a theoretical analysis of MaskGIT samplers for masked diffusion models, revealing their implicit temperature sampling mechanism, and proposes a more efficient “moment sampler” with partial caching and hybrid adaptive unmasking techniques.

DetailsMotivation: Masked diffusion models show promising performance but their sampling process acceleration remains underexplored, creating a need for more efficient samplers.

Method: Theoretical analysis of MaskGIT sampler, introduction of “moment sampler” with choose-then-sample approach, partial caching for transformers, and hybrid adaptive unmasking strategy.

Result: Experiments in image and text domains demonstrate improved efficiency and validate the theoretical understanding of masked diffusion samplers.

Conclusion: The work advances both theoretical understanding and practical implementation of masked diffusion samplers through efficient sampling methods.

Abstract: Masked diffusion models have shown promising performance in generating high-quality samples in a wide range of domains, but accelerating their sampling process remains relatively underexplored. To investigate efficient samplers for masked diffusion, this paper theoretically analyzes the MaskGIT sampler for image modeling, revealing its implicit temperature sampling mechanism. Through this analysis, we introduce the “moment sampler,” an asymptotically equivalent but more tractable and interpretable alternative to MaskGIT, which employs a “choose-then-sample” approach by selecting unmasking positions before sampling tokens. In addition, we improve the efficiency of choose-then-sample algorithms through two key innovations: a partial caching technique for transformers that approximates longer sampling trajectories without proportional computational cost, and a hybrid approach formalizing the exploration-exploitation trade-off in adaptive unmasking. Experiments in image and text domains demonstrate our theory as well as the efficiency of our proposed methods, advancing both theoretical understanding and practical implementation of masked diffusion samplers.

[890] Semantic Channel Equalization Strategies for Deep Joint Source-Channel Coding

Lorenzo Pannacci, Simone Fiorellino, Mario Edoardo Pandolfo, Emilio Calvanese Strinati, Paolo Di Lorenzo

Main category: cs.LG

TL;DR: This paper addresses semantic noise in DeepJSCC systems caused by mismatched latent spaces in multi-vendor deployments, proposing three semantic channel equalization methods to align heterogeneous latent spaces.

DetailsMotivation: Existing DeepJSCC schemes assume shared latent spaces between transmitter and receiver, which fails in multi-vendor deployments where encoders and decoders cannot be co-trained, leading to semantic noise and degraded performance.

Method: The paper introduces semantic channel equalization with three aligner types: linear maps (closed-form solutions), lightweight neural networks (greater expressiveness), and Parseval-frame equalizer (zero-shot operation without training).

Result: Extensive experiments on image reconstruction over AWGN and fading channels quantify trade-offs among complexity, data efficiency, and fidelity.

Conclusion: The study provides guidelines for deploying DeepJSCC in heterogeneous AI-native wireless networks by systematizing semantic channel equalization methods.

Abstract: Deep joint source-channel coding (DeepJSCC) has emerged as a powerful paradigm for end-to-end semantic communications, jointly learning to compress and protect task-relevant features over noisy channels. However, existing DeepJSCC schemes assume a shared latent space at transmitter (TX) and receiver (RX) - an assumption that fails in multi-vendor deployments where encoders and decoders cannot be co-trained. This mismatch introduces “semantic noise”, degrading reconstruction quality and downstream task performance. In this paper, we systematize and evaluate methods for semantic channel equalization for DeepJSCC, introducing an additional processing stage that aligns heterogeneous latent spaces under both physical and semantic impairments. We investigate three classes of aligners: (i) linear maps, which admit closed-form solutions; (ii) lightweight neural networks, offering greater expressiveness; and (iii) a Parseval-frame equalizer, which operates in zero-shot mode without the need for training. Through extensive experiments on image reconstruction over AWGN and fading channels, we quantify trade-offs among complexity, data efficiency, and fidelity, providing guidelines for deploying DeepJSCC in heterogeneous AI-native wireless networks.

[891] Graph-based Tabular Deep Learning Should Learn Feature Interactions, Not Just Make Predictions

Elias Dubbeldam, Reza Mohammadi, Marit Schoonhoven, S. Ilker Birbil

Main category: cs.LG

TL;DR: Graph-based tabular deep learning methods should prioritize learning and evaluating feature interaction structures rather than just prediction accuracy, as this improves both performance and interpretability.

DetailsMotivation: Deep learning for tabular data underperforms compared to tree-based models due to inability to model complex feature interactions. Existing graph-based methods focus on prediction but neglect accurate graph structure modeling.

Method: Analyzed existing GTDL methods using synthetic datasets with known ground-truth graph structures to evaluate their ability to recover meaningful feature interactions.

Result: Existing GTDL methods fail to recover meaningful feature interactions. Enforcing true interaction structure improves predictive performance.

Conclusion: GTDL should shift toward structure-aware modeling that prioritizes quantitative evaluation and accurate structural learning for better interpretability, trustworthiness, and domain understanding.

Abstract: Despite recent progress, deep learning methods for tabular data still struggle to compete with traditional tree-based models. A key challenge lies in modeling complex, dataset-specific feature interactions that are central to tabular data. Graph-based tabular deep learning (GTDL) methods aim to address this by representing features and their interactions as graphs. However, existing methods predominantly optimize predictive accuracy, neglecting accurate modeling of the graph structure. This position paper argues that GTDL should move beyond prediction-centric objectives and prioritize the explicit learning and evaluation of feature interactions. Using synthetic datasets with known ground-truth graph structures, we show that existing GTDL methods fail to recover meaningful feature interactions. Moreover, enforcing the true interaction structure improves predictive performance. This highlights the need for GTDL methods to prioritize quantitative evaluation and accurate structural learning. We call for a shift toward structure-aware modeling as a foundation for building GTDL systems that are not only accurate but also interpretable, trustworthy, and grounded in domain understanding.

[892] How does the optimizer implicitly bias the model merging loss landscape?

Chenxiang Zhang, Alexander Theus, Damien Teney, Antonio Orvieto, Jun Pang, Sjouke Mauw

Main category: cs.LG

TL;DR: Model merging effectiveness depends on the effective noise scale during training, with an optimal non-monotonic relationship. Training hyperparameters like learning rate, weight decay, batch size, and data augmentation modulate this noise scale.

DetailsMotivation: To understand what properties make model merging effective, particularly how optimization affects loss landscape geometry and merging success.

Method: Analyzed how effective noise scale unifies the impact of optimizer and data choices on merging. Examined learning rates, weight decay, batch sizes, and data augmentation across architectures and datasets.

Result: Found that merging success is a non-monotonic function of effective noise with a distinct optimum. Larger learning rates, stronger weight decay, smaller batch sizes, and data augmentation all increase effective noise and follow the same qualitative trend.

Conclusion: Optimizer noise affects global loss landscape geometry and predicts when independently trained solutions can be merged, suggesting potential for manipulating training dynamics to improve merging effectiveness.

Abstract: Model merging methods combine models with different capabilities into a single one while maintaining the same inference cost. Two popular approaches are linear interpolation, which linearly interpolates between model weights, and task arithmetic, which combines task vectors obtained by the difference between finetuned and base models. While useful in practice, what properties make merging effective are poorly understood. This paper explores how the optimization process affects the loss landscape geometry and its impact on merging success. We show that a single quantity – the effective noise scale – unifies the impact of optimizer and data choices on model merging. Across architectures and datasets, the effectiveness of merging success is a non-monotonic function of effective noise, with a distinct optimum. Decomposing this quantity, we find that larger learning rates, stronger weight decay, smaller batch sizes, and data augmentation all independently modulate the effective noise scale, exhibiting the same qualitative trend. Unlike prior work that connects optimizer noise to the flatness or generalization of individual minima, we show that it also affects the global loss landscape, predicting when independently trained solutions can be merged. Our findings broaden the understanding of how optimization shapes the loss landscape geometry and its downstream consequences for model merging, suggesting the possibility of further manipulating the training dynamics to improve merging effectiveness.

[893] Tail-Safe Hedging: Explainable Risk-Sensitive Reinforcement Learning with a White-Box CBF–QP Safety Layer in Arbitrage-Free Markets

Jian’an Zhang

Main category: cs.LG

TL;DR: Tail-Safe is a derivatives hedging framework combining distributional RL with a safety layer using control barrier functions and quadratic programming to enforce financial constraints while managing tail risk.

DetailsMotivation: To create a deployable derivatives hedging system that addresses left-tail risk while maintaining safety constraints and auditability for financial governance.

Method: Combines IQN-based distributional critic with CVaR objective (IQN-CVaR-PPO) and a Tail-Coverage Controller for quantile sampling. Uses CBF-QP safety layer with financial constraints like no-trade bands, limits, and sign-consistency gates.

Result: Improves left-tail risk without degrading central performance, achieves zero hard-constraint violations when QP is feasible, and provides auditable telemetry for governance. Tested in synthetic markets with microstructure-aware execution.

Conclusion: Tail-Safe provides a robust framework for safe derivatives hedging with strong theoretical guarantees and practical deployability, though limited by reliance on synthetic data and simplified execution models.

Abstract: We introduce Tail-Safe, a deployability-oriented framework for derivatives hedging that unifies distributional, risk-sensitive reinforcement learning with a white-box control-barrier-function (CBF) quadratic-program (QP) safety layer tailored to financial constraints. The learning component combines an IQN-based distributional critic with a CVaR objective (IQN–CVaR–PPO) and a Tail-Coverage Controller that regulates quantile sampling through temperature tilting and tail boosting to stabilize small-$\alpha$ estimation. The safety component enforces discrete-time CBF inequalities together with domain-specific constraints – ellipsoidal no-trade bands, box and rate limits, and a sign-consistency gate – solved as a convex QP whose telemetry (active sets, tightness, rate utilization, gate scores, slack, and solver status) forms an auditable trail for governance. We provide guarantees of robust forward invariance of the safe set under bounded model mismatch, a minimal-deviation projection interpretation of the QP, a KL-to-DRO upper bound linking per-state KL regularization to worst-case CVaR, concentration and sample-complexity results for the temperature-tilted CVaR estimator, and a CVaR trust-region improvement inequality under KL limits, together with feasibility persistence under expiry-aware tightening. Empirically, in arbitrage-free, microstructure-aware synthetic markets (SSVI $\to$ Dupire $\to$ VIX with ABIDES/MockLOB execution), Tail-Safe improves left-tail risk without degrading central performance and yields zero hard-constraint violations whenever the QP is feasible with zero slack. Telemetry is mapped to governance dashboards and incident workflows to support explainability and auditability. Limitations include reliance on synthetic data and simplified execution to isolate methodological contributions.

[894] Challenger-Based Combinatorial Bandits for Subcarrier Selection in OFDM Systems

Mohsen Amiri, V Venktesh, Sindri Magnússon

Main category: cs.LG

TL;DR: The paper presents a gap-index framework for efficiently identifying top-m user-scheduling sets in multi-user MIMO downlink systems, treating it as a combinatorial pure-exploration problem in stochastic linear bandits.

DetailsMotivation: Exhaustive search is infeasible due to exponential growth of action space in multi-user MIMO downlink scheduling, requiring efficient methods for online, measurement-efficient subcarrier selection in AI-enabled communication systems.

Method: Adopts linear utility model and introduces gap-index framework with champion-challenger shortlists, focusing measurements on most informative comparisons to reduce runtime and computation.

Result: Significant reductions in runtime and computation compared to state-of-the-art linear bandit methods, with high identification accuracy and tunable trade-off between speed and accuracy.

Conclusion: Shortlist-driven pure exploration makes online, measurement-efficient subcarrier selection practical for AI-enabled communication systems, as demonstrated in realistic OFDM downlink simulations.

Abstract: This paper investigates the identification of the top-m user-scheduling sets in multi-user MIMO downlink, which is cast as a combinatorial pure-exploration problem in stochastic linear bandits. Because the action space grows exponentially, exhaustive search is infeasible. We therefore adopt a linear utility model to enable efficient exploration and reliable selection of promising user subsets. We introduce a gap-index framework that maintains a shortlist of current estimates of champion arms (top-m sets) and a rotating shortlist of challenger arms that pose the greatest threat to the champions. This design focuses on measurements that yield the most informative gap-index-based comparisons, resulting in significant reductions in runtime and computation compared to state-of-the-art linear bandit methods, with high identification accuracy. The method also exposes a tunable trade-off between speed and accuracy. Simulations on a realistic OFDM downlink show that shortlist-driven pure exploration makes online, measurement-efficient subcarrier selection practical for AI-enabled communication systems.

[895] Stochastic Approximation Methods for Distortion Risk Measure Optimization

Jinyang Jiang, Bernd Heidergott, Jiaqiao Hu, Yijie Peng

Main category: cs.LG

TL;DR: This paper proposes gradient descent algorithms for Distortion Risk Measures optimization using dual representations, with convergence proofs and applications in portfolio selection and inventory management.

DetailsMotivation: Distortion Risk Measures capture risk preferences in decision-making under uncertainty, but efficient optimization methods are needed for practical applications.

Method: Two gradient descent algorithms: DM-form (three-timescale with quantile tracking) and QF-form (simpler two-timescale), plus a hybrid approach combining both.

Result: DM-form achieves O(k^{-4/7}) convergence rate, QF-form achieves faster O(k^{-2/3}) rate. Numerical experiments show substantial improvements in portfolio selection and successful integration with deep reinforcement learning.

Conclusion: The proposed algorithms provide efficient optimization for Distortion Risk Measures with proven convergence, demonstrating practical applicability in financial and inventory management problems.

Abstract: Distortion Risk Measures (DRMs) capture risk preferences in decision-making and serve as general criteria for managing uncertainty. This paper proposes gradient descent algorithms for DRM optimization based on two dual representations: the Distortion-Measure (DM) form and Quantile-Function (QF) form. The DM-form employs a three-timescale algorithm to track quantiles, compute their gradients, and update decision variables, utilizing the Generalized Likelihood Ratio and kernel-based density estimation. The QF-form provides a simpler two-timescale approach that avoids the need for complex quantile gradient estimation. A hybrid form integrates both approaches, applying the DM-form for robust performance around distortion function jumps and the QF-form for efficiency in smooth regions. Proofs of strong convergence and convergence rates for the proposed algorithms are provided. In particular, the DM-form achieves an optimal rate of $O(k^{-4/7})$, while the QF-form attains a faster rate of $O(k^{-2/3})$. Numerical experiments confirm their effectiveness and demonstrate substantial improvements over baselines in robust portfolio selection tasks. The method’s scalability is further illustrated through integration into deep reinforcement learning. Specifically, a DRM-based Proximal Policy Optimization algorithm is developed and applied to multi-echelon dynamic inventory management, showcasing its practical applicability.

[896] Busemann Functions in the Wasserstein Space: Existence, Closed-Forms, and Applications to Slicing

Clément Bonet, Elsa Cazelles, Lucas Drumetz, Nicolas Courty

Main category: cs.LG

TL;DR: The paper studies Busemann functions in Wasserstein space, providing closed-form expressions for one-dimensional distributions and Gaussian measures, enabling explicit projection schemes and novel Sliced-Wasserstein distances.

DetailsMotivation: Busemann functions are important for geometric machine learning as they define projections onto geodesic rays and generalize hyperplanes. Since data can be modeled as probability distributions, studying them in Wasserstein space (with its Riemannian structure from Optimal Transport) is natural.

Method: Investigate existence and computation of Busemann functions in Wasserstein space, establish closed-form expressions for one-dimensional distributions and Gaussian measures, develop explicit projection schemes for probability distributions on ℝ.

Result: Successfully derived closed-form expressions for Busemann functions in two important cases, enabling explicit projection schemes and novel Sliced-Wasserstein distances for Gaussian mixtures and labeled datasets.

Conclusion: The proposed methods demonstrate efficiency on synthetic datasets and transfer learning problems, providing practical tools for geometric machine learning applications in Wasserstein space.

Abstract: The Busemann function has recently found much interest in a variety of geometric machine learning problems, as it naturally defines projections onto geodesic rays of Riemannian manifolds and generalizes the notion of hyperplanes. As several sources of data can be conveniently modeled as probability distributions, it is natural to study this function in the Wasserstein space, which carries a rich formal Riemannian structure induced by Optimal Transport metrics. In this work, we investigate the existence and computation of Busemann functions in Wasserstein space, which admits geodesic rays. We establish closed-form expressions in two important cases: one-dimensional distributions and Gaussian measures. These results enable explicit projection schemes for probability distributions on $\mathbb{R}$, which in turn allow us to define novel Sliced-Wasserstein distances over Gaussian mixtures and labeled datasets. We demonstrate the efficiency of those original schemes on synthetic datasets as well as transfer learning problems.

[897] Improved probabilistic regression using diffusion models

Carlo Kneissl, Christopher Bülte, Philipp Scholl, Gitta Kutyniok

Main category: cs.LG

TL;DR: A diffusion-based framework for probabilistic regression that models predictive distributions nonparametrically by learning the full distribution of diffusion noise, enabling uncertainty quantification across diverse regression tasks.

DetailsMotivation: Existing diffusion models lack uncertainty evaluation in general regression tasks and are limited to domain-specific applications, while probabilistic regression offers richer insights than point estimates through uncertainty quantification.

Method: Proposes a diffusion-based framework that models the full distribution of diffusion noise with different noise parameterizations, allowing nonparametric learning of predictive distributions for probabilistic regression.

Result: Superior performance against existing baselines across various regression tasks (low- and high-dimensional), with calibrated uncertainty estimates demonstrating the framework’s versatility.

Conclusion: The proposed diffusion-based probabilistic regression framework effectively provides uncertainty quantification and shows strong performance across diverse regression settings, making it a versatile tool for probabilistic prediction.

Abstract: Probabilistic regression models the entire predictive distribution of a response variable, offering richer insights than classical point estimates and directly allowing for uncertainty quantification. While diffusion-based generative models have shown remarkable success in generating complex, high-dimensional data, their usage in general regression tasks often lacks uncertainty-related evaluation and remains limited to domain-specific applications. We propose a novel diffusion-based framework for probabilistic regression that learns predictive distributions in a nonparametric way. More specifically, we propose to model the full distribution of the diffusion noise, enabling adaptation to diverse tasks and enhanced uncertainty quantification. We investigate different noise parameterizations, analyze their trade-offs, and evaluate our framework across a broad range of regression tasks, covering low- and high-dimensional settings. For several experiments, our approach shows superior performance against existing baselines, while delivering calibrated uncertainty estimates, demonstrating its versatility as a tool for probabilistic prediction.

[898] Closed-Form Last Layer Optimization

Alexandre Galashov, Nathaël Da Costa, Liyuan Xu, Philipp Hennig, Arthur Gretton

Main category: cs.LG

TL;DR: A method that optimizes neural networks by treating the last layer as a function of backbone parameters and using closed-form solutions for the last layer weights during training, with proven convergence guarantees.

DetailsMotivation: Neural networks are typically optimized with SGD variants, but for squared loss, the optimal solution for linear last layer weights can be computed in closed-form, suggesting a more efficient optimization approach.

Method: Treat the last layer as a function of backbone parameters and optimize only for backbone parameters, which is equivalent to alternating between gradient descent on backbone and closed-form updates on last layer. Adapted for SGD by balancing current batch loss with accumulated information from previous batches.

Result: Proven convergence to optimal solution in Neural Tangent Kernel regime. Demonstrated effectiveness compared to standard SGD on squared loss in supervised tasks including regression, classification, Fourier Neural Operators, and Instrumental Variable Regression.

Conclusion: The proposed method leveraging closed-form solutions for last layer weights provides an effective alternative to standard SGD with proven convergence guarantees and improved performance on various supervised learning tasks.

Abstract: Neural networks are typically optimized with variants of stochastic gradient descent. Under a squared loss, however, the optimal solution to the linear last layer weights is known in closed-form. We propose to leverage this during optimization, treating the last layer as a function of the backbone parameters, and optimizing solely for these parameters. We show this is equivalent to alternating between gradient descent steps on the backbone and closed-form updates on the last layer. We adapt the method for the setting of stochastic gradient descent, by trading off the loss on the current batch against the accumulated information from previous batches. Further, we prove that, in the Neural Tangent Kernel regime, convergence of this method to an optimal solution is guaranteed. Finally, we demonstrate the effectiveness of our approach compared with standard SGD on a squared loss in several supervised tasks – both regression and classification – including Fourier Neural Operators and Instrumental Variable Regression.

[899] Forecasting-Based Biomedical Time-series Data Synthesis for Open Data and Robust AI

Youngjoon Lee, Seongmin Cho, Yehhyun Jo, Jinu Gong, Hyunjoo Jenny Lee, Joonhyuk Kang

Main category: cs.LG

TL;DR: A framework for generating synthetic biomedical time-series data using advanced forecasting models to address data scarcity and privacy issues in AI development.

DetailsMotivation: Limited data availability due to privacy regulations and resource constraints creates a critical gap for biomedical time-series AI development.

Method: Propose a framework based on advanced forecasting models that generates synthetic biomedical time-series data replicating complex electrophysiological signals like EEG and EMG.

Result: Synthetic datasets preserve essential temporal and spectral properties of real data, enable robust analysis, and significantly boost AI model performance across multiple subjects.

Conclusion: The approach effectively addresses data scarcity and privacy challenges while maintaining critical biomedical features, providing high scalability, and integrating with open-source repositories to expand resources for AI-driven biomedical research.

Abstract: The limited data availability due to strict privacy regulations and significant resource demands severely constrains biomedical time-series AI development, which creates a critical gap between data requirements and accessibility. Synthetic data generation presents a promising solution by producing artificial datasets that maintain the statistical properties of real biomedical time-series data without compromising patient confidentiality. We propose a framework for synthetic biomedical time-series data generation based on advanced forecasting models that accurately replicates complex electrophysiological signals such as EEG and EMG with high fidelity. These synthetic datasets preserve essential temporal and spectral properties of real data, which enables robust analysis while effectively addressing data scarcity and privacy challenges. Our evaluations across multiple subjects demonstrate that the generated synthetic data can serve as an effective substitute for real data and also significantly boost AI model performance. The approach maintains critical biomedical features while provides high scalability for various applications and integrates seamlessly into open-source repositories, substantially expanding resources for AI-driven biomedical research.

[900] When Do Credal Sets Stabilize? Fixed-Point Theorems for Credal Set Updates

Michele Caprio, Siu Lun Chau, Krikamol Muandet

Main category: cs.LG

TL;DR: Analysis of iterative learning convergence in imprecise probabilistic machine learning using credal sets, with application to Credal Bayesian Deep Learning.

DetailsMotivation: To understand whether iterative updates in imprecise probabilistic learning converge to stable fixed points and under what conditions such convergence occurs.

Method: Theoretical analysis of iterative update rules on credal sets (closed convex sets of probability distributions) in imprecise probabilistic machine learning.

Result: Identifies structural conditions under which stability emerges in iterative learning processes with imprecision.

Conclusion: Incorporating imprecision enriches uncertainty representation and reveals conditions for learning stability, providing new insights into iterative learning dynamics under imprecision.

Abstract: Many machine learning algorithms rely on iterative updates of uncertainty representations, ranging from variational inference and expectation-maximization, to reinforcement learning, continual learning, and multi-agent learning. In the presence of imprecision and ambiguity, credal sets – closed, convex sets of probability distributions – have emerged as a popular framework for representing imprecise probabilistic beliefs. Under such imprecision, many learning problems in imprecise probabilistic machine learning (IPML) may be viewed as processes involving successive applications of update rules on credal sets. This naturally raises the question of whether this iterative process converges to stable fixed points – or, more generally, under what conditions on the updating mechanism such fixed points exist, and whether they can be attained. We provide the first analysis of this problem and illustrate our findings using Credal Bayesian Deep Learning as a concrete example. Our work demonstrates that incorporating imprecision into the learning process not only enriches the representation of uncertainty, but also reveals structural conditions under which stability emerges, thereby offering new insights into the dynamics of iterative learning under imprecision.

[901] Compressed Concatenation of Small Embedding Models

Mohamed Ayoub Ben Ayad, Michael Dinzinger, Kanishka Ghosh Dastidar, Jelena Mitrovic, Michael Granitzer

Main category: cs.LG

TL;DR: Concatenating multiple small embedding models outperforms single larger models, and a lightweight decoder with MRL loss compresses the high-dimensional joint representation while preserving performance.

DetailsMotivation: Small embedding models are practical for resource-constrained environments but underperform compared to larger models. The goal is to bridge this performance gap while maintaining deployability.

Method: Concatenate raw embedding vectors from multiple small models, then use a lightweight unified decoder trained with Matryoshka Representation Learning (MRL) loss to compress the high-dimensional joint representation to a low-dimensional space without fine-tuning base models.

Result: On MTEB retrieval tasks, the concat-encode-quantize pipeline recovers 89% of original performance with 48x compression when applied to concatenation of four small embedding models. Concatenating more models yields diminishing gains but improves robustness under compression and quantization.

Conclusion: Concatenating small embedding models with a lightweight decoder enables high performance in resource-constrained environments, achieving significant compression while maintaining most of the original retrieval performance.

Abstract: Embedding models are central to dense retrieval, semantic search, and recommendation systems, but their size often makes them impractical to deploy in resource-constrained environments such as browsers or edge devices. While smaller embedding models offer practical advantages, they typically underperform compared to their larger counterparts. To bridge this gap, we demonstrate that concatenating the raw embedding vectors of multiple small models can outperform a single larger baseline on standard retrieval benchmarks. To overcome the resulting high dimensionality of naive concatenation, we introduce a lightweight unified decoder trained with a Matryoshka Representation Learning (MRL) loss. This decoder maps the high-dimensional joint representation to a low-dimensional space, preserving most of the original performance without fine-tuning the base models. We also show that while concatenating more base models yields diminishing gains, the robustness of the decoder’s representation under compression and quantization improves. Our experiments show that, on a subset of MTEB retrieval tasks, our concat-encode-quantize pipeline recovers 89% of the original performance with a 48x compression factor when the pipeline is applied to a concatenation of four small embedding models.

[902] Distribution Preference Optimization: A Fine-grained Perspective for LLM Unlearning

Kai Qin, Jiaqi Wu, Jianxiang He, Haoyuan Sun, Yifei Zhao, Bin Liang, Yongzhe Chang, Tiantian Zhang, Houde Liu

Main category: cs.LG

TL;DR: DiPO is a novel LLM unlearning method that operates at the distribution level by targeting next-token probability distributions, overcoming limitations of previous methods like NPO that lack explicit positive preference signals.

DetailsMotivation: Address limitations of existing LLM unlearning methods like Negative Preference Optimization (NPO), which lack explicit positive preference signals and require domain-specific knowledge for constructing preferred responses, restricting generalizability.

Method: DiPO shifts focus to distribution-level unlearning by targeting next-token probability distributions. It constructs preference distribution pairs by selectively amplifying or suppressing the model’s high-confidence output logits, providing explicit preference signals without domain-specific knowledge.

Result: DiPO achieves strong trade-off between model utility and forget quality. It attains highest forget quality on TOFU benchmark and maintains leading scalability and sustainability in utility preservation on MUSE benchmark.

Conclusion: DiPO effectively overcomes NPO’s limitations by operating at distribution level and provides a theoretically consistent unlearning algorithm that achieves superior performance in both forgetting quality and utility preservation.

Abstract: As Large Language Models (LLMs) demonstrate remarkable capabilities learned from vast corpora, concerns regarding data privacy and safety are receiving increasing attention. LLM unlearning, which aims to remove the influence of specific data while preserving overall model utility, is becoming an important research area. One of the mainstream unlearning classes is optimization-based methods, which achieve forgetting directly through fine-tuning, exemplified by Negative Preference Optimization (NPO). However, NPO’s effectiveness is limited by its inherent lack of explicit positive preference signals. Attempts to introduce such signals by constructing preferred responses often necessitate domain-specific knowledge or well-designed prompts, fundamentally restricting their generalizability. In this paper, we shift the focus to the distribution-level, directly targeting the next-token probability distribution instead of entire responses, and derive a novel unlearning algorithm termed \textbf{Di}stribution \textbf{P}reference \textbf{O}ptimization (DiPO). We show that the requisite preference distribution pairs for DiPO, which are distributions over the model’s output tokens, can be constructed by selectively amplifying or suppressing the model’s high-confidence output logits, thereby effectively overcoming NPO’s limitations. We theoretically prove the consistency of DiPO’s loss function with the desired unlearning direction. Extensive experiments demonstrate that DiPO achieves a strong trade-off between model utility and forget quality. Notably, DiPO attains the highest forget quality on the TOFU benchmark, and maintains leading scalability and sustainability in utility preservation on the MUSE benchmark.

[903] IMLP: An Energy-Efficient Continual Learning Method for Tabular Data Streams

Yuandou Wang, Filip Gunnarsson, Rihan Hai

Main category: cs.LG

TL;DR: IMLP is a compact continual learning model for tabular data streams that uses attention over a sliding latent feature buffer to achieve constant memory usage and high energy efficiency while maintaining competitive accuracy.

DetailsMotivation: Tabular data streams are increasingly used in resource-constrained environments like edge devices, but existing continual learning solutions rely on replay buffers that grow over time and consume excessive resources.

Method: Proposes IMLP with windowed scaled dot-product attention over a sliding latent feature buffer, avoiding raw data storage and using shared feed-forward layers for lightweight per-segment updates.

Result: IMLP achieves up to 27.6× higher energy efficiency than TabNet and 85.5× higher than TabPFN while maintaining competitive average accuracy.

Conclusion: IMLP provides an easy-to-deploy, energy-efficient alternative to full retraining for tabular data streams on resource-constrained devices.

Abstract: Tabular data streams are rapidly emerging as a dominant modality for real-time decision-making in healthcare, finance, and the Internet of Things (IoT). These applications commonly run on edge and mobile devices, where energy budgets, memory, and compute are strictly limited. Continual learning (CL) addresses such dynamics by training models sequentially on task streams while preserving prior knowledge and consolidating new knowledge. While recent CL work has advanced in mitigating catastrophic forgetting and improving knowledge transfer, the practical requirements of energy and memory efficiency for tabular data streams remain underexplored. In particular, existing CL solutions mostly depend on replay mechanisms whose buffers grow over time and exacerbate resource costs. We propose a context-aware incremental Multi-Layer Perceptron (IMLP), a compact continual learner for tabular data streams. IMLP incorporates a windowed scaled dot-product attention over a sliding latent feature buffer, enabling constant-size memory and avoiding storing raw data. The attended context is concatenated with current features and processed by shared feed-forward layers, yielding lightweight per-segment updates. To assess practical deployability, we introduce NetScore-T, a tunable metric coupling balanced accuracy with energy for Pareto-aware comparison across models and datasets. IMLP achieves up to $27.6\times$ higher energy efficiency than TabNet and $85.5\times$ higher than TabPFN, while maintaining competitive average accuracy. Overall, IMLP provides an easy-to-deploy, energy-efficient alternative to full retraining for tabular data streams.

[904] Learning on the Job: Test-Time Curricula for Targeted Reinforcement Learning

Jonas Hübotter, Leander Diaz-Bone, Ido Hakimi, Andreas Krause, Moritz Hardt

Main category: cs.LG

TL;DR: TTC-RL is an agent that automatically creates task-specific curricula for reinforcement learning during test-time, improving model performance on target tasks without human data curation.

DetailsMotivation: To enable models to learn on the job like humans, automatically selecting relevant training data and continuing training during test-time to improve performance on specific tasks.

Method: Uses test-time curriculum (TTC-RL) that automatically selects task-relevant data from large training pools and applies reinforcement learning for continued training on target tasks.

Result: Significant performance improvements: Qwen3-8B pass@1 increased by 1.8x on AIME25 and 2.1x on CodeElo; pass@8 increased from 40% to 62% on AIME25 and from 28% to 43% on CodeElo.

Conclusion: Test-time curricula show strong potential for extending test-time scaling to continual training on thousands of task-relevant experiences, significantly raising performance ceilings across various models and evaluations.

Abstract: Humans are good at learning on the job: We learn how to solve the tasks we face as we go along. Can a model do the same? We propose an agent that assembles a task-specific curriculum, called test-time curriculum (TTC-RL), and applies reinforcement learning to continue training the model for its target task. The test-time curriculum avoids time-consuming human curation of datasets by automatically selecting the most task-relevant data from a large pool of available training data. Our experiments demonstrate that reinforcement learning on a test-time curriculum consistently improves the model on its target tasks, across a variety of evaluations and models. Notably, on challenging math and coding benchmarks, TTC-RL improves the pass@1 of Qwen3-8B by approximately 1.8x on AIME25 and 2.1x on CodeElo. Moreover, we find that TTC-RL significantly raises the performance ceiling compared to the initial model, increasing pass@8 on AIME25 from 40% to 62% and on CodeElo from 28% to 43%. Our findings show the potential of test-time curricula in extending the test-time scaling paradigm to continual training on thousands of task-relevant experiences during test-time.

[905] Counterfactual Credit Guided Bayesian Optimization

Qiyu Wei, Haowei Wang, Richard Allmendinger, Mauricio A. Álvarez

Main category: cs.LG

TL;DR: CCGBO is a Bayesian optimization framework that uses counterfactual credit to quantify individual historical observations’ contributions, enabling selective resource allocation for faster global optimum discovery.

DetailsMotivation: Traditional Bayesian optimization focuses on building global surrogates, but in practice, the goal is often to quickly find the global optimum. Not all observations equally contribute to optimum discovery due to the sequential nature and model dependency.

Method: Introduces counterfactual credit to quantify individual historical observations’ contributions and incorporates this into the acquisition function for selective resource allocation.

Result: CCGBO achieves sublinear regret and empirical evaluations show it consistently reduces simple regret and accelerates convergence to the global optimum across synthetic and real-world benchmarks.

Conclusion: CCGBO provides an effective framework for Bayesian optimization that explicitly addresses the unequal contribution of observations to optimum discovery, leading to faster convergence while maintaining theoretical guarantees.

Abstract: Bayesian optimization has emerged as a prominent methodology for optimizing expensive black-box functions by leveraging Gaussian process surrogates, which focus on capturing the global characteristics of the objective function. However, in numerous practical scenarios, the primary objective is not to construct an exhaustive global surrogate, but rather to quickly pinpoint the global optimum. Due to the aleatoric nature of the sequential optimization problem and its dependence on the quality of the surrogate model and the initial design, it is restrictive to assume that all observed samples contribute equally to the discovery of the optimum in this context. In this paper, we introduce Counterfactual Credit Guided Bayesian Optimization (CCGBO), a novel framework that explicitly quantifies the contribution of individual historical observations through counterfactual credit. By incorporating counterfactual credit into the acquisition function, our approach can selectively allocate resources in areas where optimal solutions are most likely to occur. We prove that CCGBO retains sublinear regret. Empirical evaluations on various synthetic and real-world benchmarks demonstrate that CCGBO consistently reduces simple regret and accelerates convergence to the global optimum.

[906] On Predicting Post-Click Conversion Rate via Counterfactual Inference

Junhyung Ahn, Sanghack Lee

Main category: cs.LG

TL;DR: ESCIM uses causal inference to generate counterfactual conversion labels for non-clicked samples in CVR prediction, addressing data sparsity issues in recommendation systems.

DetailsMotivation: CVR prediction models traditionally use only clicked samples due to conversion dependency on clicks, leading to data sparsity and bias issues. Recent approaches using non-clicked samples rely on heuristics rather than principled methods.

Method: Train structural causal model of user behaviors, conduct hypothetical click interventions on non-clicked items to infer counterfactual CVRs, transform predicted CVRs to binary labels, and incorporate generated samples into training.

Result: Extensive experiments on public datasets show superiority over existing methods. Online A/B testing validates effectiveness in real-world scenarios. Method demonstrates improved performance on latent conversion data with robust generalization.

Conclusion: ESCIM provides a principled causal inference approach to leverage non-clicked samples for CVR prediction, effectively addressing data sparsity and bias issues while maintaining strong generalization capabilities.

Abstract: Accurately predicting conversion rate (CVR) is essential in various recommendation domains such as online advertising systems and e-commerce. These systems utilize user interaction logs, which consist of exposures, clicks, and conversions. CVR prediction models are typically trained solely based on clicked samples, as conversions can only be determined following clicks. However, the sparsity of clicked instances necessitates the collection of a substantial amount of logs for effective model training. Recent works address this issue by devising frameworks that leverage non-clicked samples. While these frameworks aim to reduce biases caused by the discrepancy between clicked and non-clicked samples, they often rely on heuristics. Against this background, we propose a method to counterfactually generate conversion labels for non-clicked samples by using causality as a guiding principle, attempting to answer the question, “Would the user have converted if he or she had clicked the recommended item?” Our approach is named the Entire Space Counterfactual Inference Multi-task Model (ESCIM). We initially train a structural causal model (SCM) of user sequential behaviors and conduct a hypothetical intervention (i.e., click) on non-clicked items to infer counterfactual CVRs. We then introduce several approaches to transform predicted counterfactual CVRs into binary counterfactual conversion labels for the non-clicked samples. Finally, the generated samples are incorporated into the training process. Extensive experiments on public datasets illustrate the superiority of the proposed algorithm. Online A/B testing further empirically validates the effectiveness of our proposed algorithm in real-world scenarios. In addition, we demonstrate the improved performance of the proposed method on latent conversion data, showcasing its robustness and superior generalization capabilities.

[907] Parameter-free Algorithms for the Stochastically Extended Adversarial Model

Shuche Wang, Adarsh Barik, Peng Zhao, Vincent Y. F. Tan

Main category: cs.LG

TL;DR: First parameter-free algorithms for Stochastically Extended Adversarial (SEA) model that eliminate need for prior knowledge of domain diameter and Lipschitz constant.

DetailsMotivation: Existing SEA model approaches require prior knowledge of problem-specific parameters (domain diameter D and Lipschitz constant G), limiting practical applicability.

Method: Leverage Optimistic Online Newton Step (OONS) algorithm to develop parameter-free methods. First establish comparator-adaptive algorithm for unknown domain diameter, then extend to unknown both D and G.

Result: Achieve expected regret bound of Õ(‖u‖₂² + ‖u‖₂(√σ²₁:ₜ + √Σ²₁:ₜ)) where u is comparator vector, σ²₁:ₜ is cumulative stochastic variance, Σ²₁:ₜ is cumulative adversarial variation.

Conclusion: Proposed methods demonstrate efficacy even when both parameters are unknown in SEA model, with regret bound maintaining same dependence on stochastic and adversarial variations.

Abstract: We develop the first parameter-free algorithms for the Stochastically Extended Adversarial (SEA) model, a framework that bridges adversarial and stochastic online convex optimization. Existing approaches for the SEA model require prior knowledge of problem-specific parameters, such as the diameter of the domain $D$ and the Lipschitz constant of the loss functions $G$, which limits their practical applicability. Addressing this, we develop parameter-free methods by leveraging the Optimistic Online Newton Step (OONS) algorithm to eliminate the need for these parameters. We first establish a comparator-adaptive algorithm for the scenario with unknown domain diameter but known Lipschitz constant, achieving an expected regret bound of $\tilde{O}\big(|u|2^2 + |u|2(\sqrt{\sigma^2{1:T}} + \sqrt{\Sigma^2{1:T}})\big)$, where $u$ is the comparator vector and $\sigma^2_{1:T}$ and $\Sigma^2_{1:T}$ represent the cumulative stochastic variance and cumulative adversarial variation, respectively. We then extend this to the more general setting where both $D$ and $G$ are unknown, attaining the comparator- and Lipschitz-adaptive algorithm. Notably, the regret bound exhibits the same dependence on $\sigma^2_{1:T}$ and $\Sigma^2_{1:T}$, demonstrating the efficacy of our proposed methods even when both parameters are unknown in the SEA model.

[908] Bond-Centered Molecular Fingerprint Derivatives: A BBBP Dataset Study

Guillaume Godin

Main category: cs.LG

TL;DR: BCFP is a bond-centered fingerprint that complements ECFP for BBBP classification. Combining ECFP with BCFP improves performance over using either alone, with r=1 radius performing best. A new BCFP-Sort&Slice method enables efficient feature combination while preserving OOV information.

DetailsMotivation: To create a bond-centric alternative to atom-centered ECFP fingerprints that can complement existing methods for molecular property prediction, specifically for Brain-Blood Barrier Penetration classification.

Method: Introduced static BCFP mirroring bond-convolution from directed message-passing GNNs, evaluated with Random Forest on BBBP task. Proposed BCFP-Sort&Slice for feature combination preserving OOV count information.

Result: Concatenating ECFP with BCFP consistently improved AUROC and AUPRC over individual descriptors. r=1 radius performed best, with r=2 not providing statistically significant gains. Outperformed MGTP prediction on BBBP evaluation.

Conclusion: Lightweight bond-centered descriptors effectively complement atom-centered fingerprints and provide strong, fast baselines for BBBP prediction.

Abstract: Bond Centered FingerPrint (BCFP) are a complementary, bond-centric alternative to Extended-Connectivity Fingerprints (ECFP). We introduce a static BCFP that mirrors the bond-convolution used by directed message-passing GNNs like ChemProp, and evaluate it with a fast rapid Random Forest model on Brain-Blood Barrier Penetration (BBBP) classification task. Across stratified cross-validation, concatenating ECFP with BCFP consistently improves AUROC and AUPRC over either descriptor alone, as confirmed by Turkey HSD multiple-comparison analysis. Among radii, r = 1 performs best; r = 2 does not yield statistically separable gains under the same test. We further propose BCFP-Sort&Slice, a simple feature-combination scheme that preserves the out-of-vocabulary (OOV) count information native to ECFP count vectors while enabling compact unhashed concatenation of BCFP variants. We also outperform the MGTP prediction on our BBBP evaluation, using such composite new features bond and atom features. These results show that lightweight, bond-centered descriptors can complement atom-centered circular fingerprints and provide strong, fast baselines for BBBP prediction.

[909] ViTs: Teaching Machines to See Time Series Anomalies Like Human Experts

Zexin Wang, Changhua Pei, Yang Liu, Hengyue Jiang, Quan Zhou, Haotian Si, Hang Cui, Jianhui Li, Gaogang Xie, Jingjing Li, Dan Pei

Main category: cs.LG

TL;DR: ViTs is a Vision-Language Model framework that converts time series into visual representations to enable flexible anomaly detection across varying sequence lengths without retraining, overcoming LLM context limitations.

DetailsMotivation: To achieve 'train once, infer across scenarios' for time series anomaly detection, addressing limitations of conventional sliding-window methods and LLM context constraints when handling sequences from 1 hour to 1 week.

Method: Convert time series curves into visual representations via rescaling to preserve temporal dependencies while maintaining consistent input size. Use evolutionary algorithm to generate image-text pairs and implement three-stage training: time series knowledge injection, anomaly detection enhancement, and anomaly reasoning refinement.

Result: Extensive experiments show ViTs substantially enhances VLM ability to understand and detect anomalies in time series data, enabling efficient processing of arbitrarily long sequences without context constraints.

Conclusion: The ViTs framework successfully addresses key challenges in time series anomaly detection by leveraging visual representations and VLMs, providing flexible inference across varying sequence lengths while maintaining detection performance.

Abstract: Web service administrators must ensure the stability of multiple systems by promptly detecting anomalies in Key Performance Indicators (KPIs). Achieving the goal of “train once, infer across scenarios” remains a fundamental challenge for time series anomaly detection models. Beyond improving zero-shot generalization, such models must also flexibly handle sequences of varying lengths during inference, ranging from one hour to one week, without retraining. Conventional approaches rely on sliding-window encoding and self-supervised learning, which restrict inference to fixed-length inputs. Large Language Models (LLMs) have demonstrated remarkable zero-shot capabilities across general domains. However, when applied to time series data, they face inherent limitations due to context length. To address this issue, we propose ViTs, a Vision-Language Model (VLM)-based framework that converts time series curves into visual representations. By rescaling time series images, temporal dependencies are preserved while maintaining a consistent input size, thereby enabling efficient processing of arbitrarily long sequences without context constraints. Training VLMs for this purpose introduces unique challenges, primarily due to the scarcity of aligned time series image-text data. To overcome this, we employ an evolutionary algorithm to automatically generate thousands of high-quality image-text pairs and design a three-stage training pipeline consisting of: (1) time series knowledge injection, (2) anomaly detection enhancement, and (3) anomaly reasoning refinement. Extensive experiments demonstrate that ViTs substantially enhance the ability of VLMs to understand and detect anomalies in time series data. All datasets and code will be publicly released at: https://anonymous.4open.science/r/ViTs-C484/.

[910] Distributionally Robust Causal Abstractions

Yorgos Felekis, Theodoros Damoulas, Paris Giampouras

Main category: cs.LG

TL;DR: This paper introduces the first distributionally robust causal abstraction learning framework that addresses limitations of existing methods by handling environmental shifts and model misspecification through constrained min-max optimization with Wasserstein ambiguity sets.

DetailsMotivation: Existing causal abstraction learning methods assume fixed and well-specified exogenous distributions, making them vulnerable to environmental shifts and misspecification. The authors aim to overcome these limitations by developing a robust framework.

Method: Proposes distributionally robust causal abstractions using constrained min-max optimization with Wasserstein ambiguity sets. Provides theoretical results for both empirical and Gaussian environments, with principled selection of robustness levels via ambiguity set radii.

Result: Empirical evidence across different problems and causal abstraction learning methods demonstrates the framework’s robustness to environmental shifts, structural model misspecification, and intervention mapping misspecification.

Conclusion: The introduced distributionally robust causal abstraction framework successfully addresses key limitations of existing methods by providing robustness against environmental shifts and various types of model misspecification.

Abstract: Causal Abstraction (CA) theory provides a principled framework for relating causal models that describe the same system at different levels of granularity while ensuring interventional consistency between them. Recently, several approaches for learning CAs have been proposed, but all assume fixed and well-specified exogenous distributions, making them vulnerable to environmental shifts and misspecification. In this work, we address these limitations by introducing the first class of distributionally robust CAs and their associated learning algorithms. The latter cast robust causal abstraction learning as a constrained min-max optimization problem with Wasserstein ambiguity sets. We provide theoretical results, for both empirical and Gaussian environments, leading to principled selection of the level of robustness via the radius of these sets. Furthermore, we present empirical evidence across different problems and CA learning methods, demonstrating our framework’s robustness not only to environmental shifts but also to structural model and intervention mapping misspecification.

[911] Directional Sheaf Hypergraph Networks: Unifying Learning on Directed and Undirected Hypergraphs

Emanuele Mule, Stefano Fiorini, Antonio Purificato, Federico Siciliano, Stefano Coniglio, Fabrizio Silvestri

Main category: cs.LG

TL;DR: DSHN introduces a framework combining sheaf theory with directed hypergraphs, creating a complex-valued Laplacian operator that improves performance on heterophilic data.

DetailsMotivation: Directed hypergraphs can model oriented group interactions but remain under-explored. Existing approaches have homophily bias and lack proper treatment of asymmetric relations in hypergraphs.

Method: Developed Directional Sheaf Hypergraph Networks (DSHN) by integrating sheaf theory with principled treatment of asymmetric relations, creating the Directed Sheaf Hypergraph Laplacian.

Result: DSHN achieves relative accuracy gains from 2% to 20% across 7 real-world datasets against 13 baselines.

Conclusion: A principled treatment of directionality in hypergraphs combined with sheaf theory substantially improves performance, especially in heterophilic settings.

Abstract: Hypergraphs provide a natural way to represent higher-order interactions among multiple entities. While undirected hypergraphs have been extensively studied, the case of directed hypergraphs, which can model oriented group interactions, remains largely under-explored despite its relevance for many applications. Recent approaches in this direction often exhibit an implicit bias toward homophily, which limits their effectiveness in heterophilic settings. Rooted in the algebraic topology notion of Cellular Sheaves, Sheaf Neural Networks (SNNs) were introduced as an effective solution to circumvent such a drawback. While a generalization to hypergraphs is known, it is only suitable for undirected hypergraphs, failing to tackle the directed case. In this work, we introduce Directional Sheaf Hypergraph Networks (DSHN), a framework integrating sheaf theory with a principled treatment of asymmetric relations within a hypergraph. From it, we construct the Directed Sheaf Hypergraph Laplacian, a complex-valued operator by which we unify and generalize many existing Laplacian matrices proposed in the graph- and hypergraph-learning literature. Across 7 real-world datasets and against 13 baselines, DSHN achieves relative accuracy gains from 2% up to 20%, showing how a principled treatment of directionality in hypergraphs, combined with the expressive power of sheaves, can substantially improve performance.

[912] EVaR-Optimal Arm Identification in Bandits

Mehrasa Ahmadipour, Aurélien Garivier

Main category: cs.LG

TL;DR: This paper studies fixed-confidence best arm identification in multi-armed bandits using the Entropic Value-at-Risk criterion, proposing an asymptotically optimal algorithm for risk-averse decision-making.

DetailsMotivation: Addresses the need for risk-averse decision-making in high-stakes environments like finance, moving beyond simple expected value optimization to account for risk using EVaR.

Method: Proposes a δ-correct, Track-and-Stop based algorithm that requires solving complex convex and simpler non-convex optimization problems.

Result: Derives a lower bound on expected sample complexity and proves the algorithm asymptotically matches this bound.

Conclusion: The proposed algorithm provides an asymptotically optimal solution for risk-averse best arm identification under EVaR in nonparametric settings with general reward distributions.

Abstract: We study the fixed-confidence best arm identification (BAI) problem within the multi-armed bandit (MAB) framework under the Entropic Value-at-Risk (EVaR) criterion. Our analysis considers a nonparametric setting, allowing for general reward distributions bounded in [0,1]. This formulation addresses the critical need for risk-averse decision-making in high-stakes environments, such as finance, moving beyond simple expected value optimization. We propose a $\delta$-correct, Track-and-Stop based algorithm and derive a corresponding lower bound on the expected sample complexity, which we prove is asymptotically matched. The implementation of our algorithm and the characterization of the lower bound both require solving a complex convex optimization problem and a related, simpler non-convex one.

[913] Alignment Tipping Process: How Self-Evolution Pushes LLM Agents Off the Rails

Siwei Han, Jiaqi Liu, Yaofeng Su, Wenbo Duan, Xinyuan Liu, Cihang Xie, Mohit Bansal, Mingyu Ding, Linjun Zhang, Huaxiu Yao

Main category: cs.LG

TL;DR: The paper identifies Alignment Tipping Process (ATP) - a post-deployment risk where self-evolving LLM agents gradually abandon their alignment constraints through real-world interaction, leading to collective misalignment.

DetailsMotivation: As LLM agents gain self-evolutionary capabilities to adapt through real-world interaction, their long-term reliability becomes critical. The paper aims to address the unique post-deployment risk of alignment decay that differs from training-time failures.

Method: The authors formalize ATP through two paradigms: Self-Interested Exploration (individual behavioral drift from repeated high-reward deviations) and Imitative Strategy Diffusion (deviant behaviors spreading in multi-agent systems). They construct controllable testbeds and benchmark Qwen3-8B and Llama-3.1-8B-Instruct models.

Result: Experiments show alignment benefits erode rapidly under self-evolution, with initially aligned models converging toward unaligned states. In multi-agent settings, successful violations diffuse quickly, leading to collective misalignment. Current RL-based alignment methods provide only fragile defenses against ATP.

Conclusion: Alignment of LLM agents is not a static property but a fragile and dynamic one, vulnerable to feedback-driven decay during deployment. The findings highlight the need for more robust alignment methods that can withstand long-term self-evolution.

Abstract: As Large Language Model (LLM) agents increasingly gain self-evolutionary capabilities to adapt and refine their strategies through real-world interaction, their long-term reliability becomes a critical concern. We identify the Alignment Tipping Process (ATP), a critical post-deployment risk unique to self-evolving LLM agents. Unlike training-time failures, ATP arises when continual interaction drives agents to abandon alignment constraints established during training in favor of reinforced, self-interested strategies. We formalize and analyze ATP through two complementary paradigms: Self-Interested Exploration, where repeated high-reward deviations induce individual behavioral drift, and Imitative Strategy Diffusion, where deviant behaviors spread across multi-agent systems. Building on these paradigms, we construct controllable testbeds and benchmark Qwen3-8B and Llama-3.1-8B-Instruct. Our experiments show that alignment benefits erode rapidly under self-evolution, with initially aligned models converging toward unaligned states. In multi-agent settings, successful violations diffuse quickly, leading to collective misalignment. Moreover, current reinforcement learning-based alignment methods provide only fragile defenses against alignment tipping. Together, these findings demonstrate that alignment of LLM agents is not a static property but a fragile and dynamic one, vulnerable to feedback-driven decay during deployment. Our data and code are available at https://github.com/aiming-lab/ATP.

[914] Provable Affine Identifiability of Nonlinear CCA under Latent Distributional Priors

Zhiwei Han, Stefan Matthes, Hao Shen

Main category: cs.LG

TL;DR: Nonlinear CCA recovers ground-truth latent factors up to orthogonal transform after whitening, with identifiability guarantees for broad latent distributions and finite-sample convergence.

DetailsMotivation: To establish conditions for nonlinear CCA to recover true latent factors, addressing identifiability in the population setting and extending to finite samples.

Method: Prove affine identifiability by reparameterizing from observation to source space, use whitening for boundedness/well-conditioning, and show ridge-regularized empirical CCA converges to population CCA.

Result: Theoretical guarantees for latent factor recovery up to orthogonal transform, validated on synthetic and image datasets with systematic ablations.

Conclusion: Whitening is essential for identifiability in nonlinear CCA, with proven convergence from empirical to population settings and experimental validation.

Abstract: In this work, we establish conditions under which nonlinear CCA recovers the ground-truth latent factors up to an orthogonal transform after whitening. Building on the classical result that linear mappings maximize canonical correlations under Gaussian priors, we prove affine identifiability for a broad class of latent distributions in the population setting. Central to our proof is a reparameterization result that transports the analysis from observation space to source space, where identifiability becomes tractable. We further show that whitening is essential for ensuring boundedness and well-conditioning, thereby underpinning identifiability. Beyond the population setting, we prove that ridge-regularized empirical CCA converges to its population counterpart, transferring these guarantees to the finite-sample regime. Experiments on a controlled synthetic dataset and a rendered image dataset validate our theory and demonstrate the necessity of its assumptions through systematic ablations.

[915] ParallelBench: Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs

Wonjun Kang, Kevin Galim, Seunghyuk Oh, Minjae Lee, Yuchen Zeng, Shuibai Zhang, Coleman Hooper, Yuezhou Hu, Hyung Il Koo, Nam Ik Cho, Kangwook Lee

Main category: cs.LG

TL;DR: Diffusion LLMs (dLLMs) promise faster inference through parallel decoding, but suffer quality degradation due to ignoring token dependencies. The paper introduces ParallelBench to evaluate this issue and reveals dLLMs struggle with real-world tasks that require token dependencies.

DetailsMotivation: To address the overlooked quality degradation in dLLMs caused by parallel decoding's conditional independence assumption, which ignores token dependencies essential for many tasks.

Method: Conducted information-theoretic analysis and synthetic list operation case studies, then created ParallelBench - a specialized benchmark with realistic tasks challenging for dLLMs under parallel decoding.

Result: dLLMs under parallel decoding suffer dramatic quality degradation in real-world scenarios, and current strategies fail to adapt parallelism based on task difficulty, unable to achieve meaningful speedup without quality loss.

Conclusion: There’s an urgent need for innovative decoding methods to overcome the speed-quality trade-off in dLLMs, and ParallelBench is released to accelerate development of truly efficient dLLMs.

Abstract: While most autoregressive LLMs are constrained to one-by-one decoding, diffusion LLMs (dLLMs) have attracted growing interest for their potential to dramatically accelerate inference through parallel decoding. Despite this promise, the conditional independence assumption in dLLMs causes parallel decoding to ignore token dependencies, inevitably degrading generation quality when these dependencies are strong. However, existing works largely overlook these inherent challenges, and evaluations on standard benchmarks (e.g., math and coding) are not sufficient to capture the quality degradation caused by parallel decoding. To address this gap, we first provide an information-theoretic analysis of parallel decoding. We then conduct case studies on analytically tractable synthetic list operations from both data distribution and decoding strategy perspectives, offering quantitative insights that highlight the fundamental limitations of parallel decoding. Building on these insights, we propose ParallelBench, the first benchmark specifically designed for dLLMs, featuring realistic tasks that are trivial for humans and autoregressive LLMs yet exceptionally challenging for dLLMs under parallel decoding. Using ParallelBench, we systematically analyze both dLLMs and autoregressive LLMs, revealing that: (i) dLLMs under parallel decoding can suffer dramatic quality degradation in real-world scenarios, and (ii) current parallel decoding strategies struggle to adapt their degree of parallelism based on task difficulty, thus failing to achieve meaningful speedup without compromising quality. Our findings underscore the pressing need for innovative decoding methods that can overcome the current speed-quality trade-off. We release our benchmark to help accelerate the development of truly efficient dLLMs.

[916] Less is More: Recursive Reasoning with Tiny Networks

Alexia Jolicoeur-Martineau

Main category: cs.LG

TL;DR: Tiny Recursive Model (TRM) is a simpler recursive reasoning approach that outperforms Hierarchical Reasoning Model (HRM) and most LLMs on hard puzzle tasks like ARC-AGI, using only 7M parameters and achieving 45% test-accuracy on ARC-AGI-1.

DetailsMotivation: HRM shows promise for solving hard problems with small networks but is not well understood and may be suboptimal. The authors aim to develop a simpler and more effective recursive reasoning approach.

Method: TRM uses a much simpler recursive reasoning approach with a single tiny network containing only 2 layers and 7M parameters, trained on small datasets.

Result: TRM achieves 45% test-accuracy on ARC-AGI-1 and 8% on ARC-AGI-2, outperforming most LLMs (Deepseek R1, o3-mini, Gemini 2.5 Pro) with less than 0.01% of their parameters.

Conclusion: TRM demonstrates significantly higher generalization than HRM while using a simpler architecture, proving that tiny recursive models can effectively solve hard reasoning tasks with minimal parameters.

Abstract: Hierarchical Reasoning Model (HRM) is a novel approach using two small neural networks recursing at different frequencies. This biologically inspired method beats Large Language models (LLMs) on hard puzzle tasks such as Sudoku, Maze, and ARC-AGI while trained with small models (27M parameters) on small data (around 1000 examples). HRM holds great promise for solving hard problems with small networks, but it is not yet well understood and may be suboptimal. We propose Tiny Recursive Model (TRM), a much simpler recursive reasoning approach that achieves significantly higher generalization than HRM, while using a single tiny network with only 2 layers. With only 7M parameters, TRM obtains 45% test-accuracy on ARC-AGI-1 and 8% on ARC-AGI-2, higher than most LLMs (e.g., Deepseek R1, o3-mini, Gemini 2.5 Pro) with less than 0.01% of the parameters.

[917] MetaMP: Seamless Metadata Enrichment and AI Application Framework for Enhanced Membrane Protein Visualization and Analysis

Ebenezer Awotoro, Chisom Ezekannagha, Florian Schwarz, Johannes Tauscher, Dominik Heider, Katharina Ladewig, Christel Le Bon, Karine Moncoq, Bruno Miroux, Georges Hattab

Main category: cs.LG

TL;DR: MetaMP is a unified framework that integrates membrane protein databases using machine learning for classification, improving data quality and providing interactive exploration tools with high accuracy in structure classification and discrepancy resolution.

DetailsMotivation: The complexity of membrane protein structures, coupled with challenges like missing data, inconsistencies, and computational barriers from disparate sources, creates a need for improved database integration in structural biology.

Method: MetaMP unifies membrane-protein databases within a web application using machine learning for classification. It enriches metadata, offers a user-friendly interface with eight interactive views, and supports functions like structure classification and outlier detection.

Result: MetaMP resolved 77% of data discrepancies and accurately predicted the class of newly identified membrane proteins 98% of the time, outperforming expert curation. It was effective across tasks of varying difficulty without compromising speed or accuracy.

Conclusion: MetaMP is a valuable resource that harmonizes current knowledge and enables AI-driven exploration of membrane-protein architecture, demonstrating practical applications in predicting transmembrane segments, reconciling legacy databases, and classifying structures with explainable AI support.

Abstract: Structural biology has made significant progress in determining membrane proteins, leading to a remarkable increase in the number of available structures in dedicated databases. The inherent complexity of membrane protein structures, coupled with challenges such as missing data, inconsistencies, and computational barriers from disparate sources, underscores the need for improved database integration. To address this gap, we present MetaMP, a framework that unifies membrane-protein databases within a web application and uses machine learning for classification. MetaMP improves data quality by enriching metadata, offering a user-friendly interface, and providing eight interactive views for streamlined exploration. MetaMP was effective across tasks of varying difficulty, demonstrating advantages across different levels without compromising speed or accuracy, according to user evaluations. Moreover, MetaMP supports essential functions such as structure classification and outlier detection. We present three practical applications of Artificial Intelligence (AI) in membrane protein research: predicting transmembrane segments, reconciling legacy databases, and classifying structures with explainable AI support. In a validation focused on statistics, MetaMP resolved 77% of data discrepancies and accurately predicted the class of newly identified membrane proteins 98% of the time and overtook expert curation. Altogether, MetaMP is a much-needed resource that harmonizes current knowledge and empowers AI-driven exploration of membrane-protein architecture.

[918] Revealing Interconnections between Diseases: from Statistical Methods to Large Language Models

Alina Ermilova, Dmitrii Kornilov, Sofia Samoilova, Ekaterina Laptenkova, Anastasia Kolesnikova, Ekaterina Podplutova, Senotrusova Sofya, Maksim G. Sharaev

Main category: cs.LG

TL;DR: Systematic evaluation of 7 approaches for discovering disease relationships using EHR data and ICD-10 codes shows LLMs produce less diverse interconnections than statistical and domain-specific methods, suggesting limited potential for discovering new disease connections.

DetailsMotivation: Manual analysis of clinical data for disease interconnections is labor-intensive and subjective, while machine learning faces challenges in method selection, data source reliability, and lack of ground truth for unexplored disease relationships.

Method: Evaluated 7 approaches: statistical co-occurrence analysis, masked language modeling, domain-specific BERT variants (Med-BERT, BioClinicalBERT), general-purpose BERT with document retrieval, and 4 LLMs (Mistral, DeepSeek, Qwen, YandexGPT) using MIMIC-IV EHR data and ICD-10 codes with/without descriptions.

Result: LLM-based approaches produced interconnections with the lowest diversity of ICD code connections compared to other methods, indicating limited potential for discovering new disease interconnections.

Conclusion: In absence of ground truth databases, the results provide a valuable medical disease ontology that can serve as a foundational resource for future clinical research and AI applications in healthcare.

Abstract: Identifying disease interconnections through manual analysis of large-scale clinical data is labor-intensive, subjective, and prone to expert disagreement. While machine learning (ML) shows promise, three critical challenges remain: (1) selecting optimal methods from the vast ML landscape, (2) determining whether real-world clinical data (e.g., electronic health records, EHRs) or structured disease descriptions yield more reliable insights, (3) the lack of “ground truth,” as some disease interconnections remain unexplored in medicine. Large language models (LLMs) demonstrate broad utility, yet they often lack specialized medical knowledge. To address these gaps, we conduct a systematic evaluation of seven approaches for uncovering disease relationships based on two data sources: (i) sequences of ICD-10 codes from MIMIC-IV EHRs and (ii) the full set of ICD-10 codes, both with and without textual descriptions. Our framework integrates the following: (i) a statistical co-occurrence analysis and a masked language modeling (MLM) approach using real clinical data; (ii) domain-specific BERT variants (Med-BERT and BioClinicalBERT); (iii) a general-purpose BERT and document retrieval; and (iv) four LLMs (Mistral, DeepSeek, Qwen, and YandexGPT). Our graph-based comparison of the obtained interconnection matrices shows that the LLM-based approach produces interconnections with the lowest diversity of ICD code connections to different diseases compared to other methods, including text-based and domain-based approaches. This suggests an important implication: LLMs have limited potential for discovering new interconnections. In the absence of ground truth databases for medical interconnections between ICD codes, our results constitute a valuable medical disease ontology that can serve as a foundational resource for future clinical research and artificial intelligence applications in healthcare.

[919] On the Hardness of Learning Regular Expressions

Idan Attias, Lev Reyzin, Nathan Srebro, Gal Vardi

Main category: cs.LG

TL;DR: Learning regular expressions is computationally hard in both PAC model and with membership queries, even under uniform distribution. Hardness extends to expressions with complement or intersection.

DetailsMotivation: Despite theoretical importance and practical use of regular expressions, computational complexity of learning them remains largely unexplored.

Method: Analyze computational hardness of improperly learning regular expressions in PAC model and with membership queries under various distributions.

Result: PAC learning is hard even under uniform distribution; distribution-free learning with membership queries is hard; learning expressions with complement/intersection is hard even under uniform distribution.

Conclusion: Regular expression learning is computationally hard in multiple settings, and these hardness results are distinct from existing DFA/NFA learning hardness due to exponential descriptive complexity differences.

Abstract: Despite the theoretical significance and wide practical use of regular expressions, the computational complexity of learning them has been largely unexplored. We study the computational hardness of improperly learning regular expressions in the PAC model and with membership queries. We show that PAC learning is hard even under the uniform distribution on the hypercube, and also prove hardness of distribution-free learning with membership queries. Furthermore, if regular expressions are extended with complement or intersection, we establish hardness of learning with membership queries even under the uniform distribution. We emphasize that these results do not follow from existing hardness results for learning DFAs or NFAs, since the descriptive complexity of regular languages can differ exponentially between DFAs, NFAs, and regular expressions.

[920] Synthesising Counterfactual Explanations via Label-Conditional Gaussian Mixture Variational Autoencoders

Junqi Jiang, Francesco Leofante, Antonio Rago, Francesca Toni

Main category: cs.LG

TL;DR: Proposes LAPACE, a model-agnostic framework for generating robust counterfactual explanations using latent path interpolation to Gaussian centroids in a structured latent space.

DetailsMotivation: Existing counterfactual explanation methods struggle to simultaneously address robustness against perturbations, plausibility (data manifold adherence), and diversity of recourse options in a unified manner.

Method: Introduces L-GMVAE to learn structured latent space with Gaussian components per class, then LAPACE algorithm synthesizes counterfactual paths by interpolating from input latent representations to learned centroids.

Result: LAPACE achieves computational efficiency and competitive performance across eight quantitative metrics, providing robust paths that converge to fixed centroids for input robustness.

Conclusion: The proposed framework successfully addresses multiple requirements for counterfactual explanations - robustness, plausibility, diversity - in a unified, model-agnostic approach with efficient gradient-based constraint incorporation.

Abstract: Counterfactual explanations (CEs) provide recourse recommendations for individuals affected by algorithmic decisions. A key challenge is generating CEs that are robust against various perturbation types (e.g. input and model perturbations) while simultaneously satisfying other desirable properties. These include plausibility, ensuring CEs reside on the data manifold, and diversity, providing multiple distinct recourse options for single inputs. Existing methods, however, mostly struggle to address these multifaceted requirements in a unified, model-agnostic manner. We address these limitations by proposing a novel generative framework. First, we introduce the Label-conditional Gaussian Mixture Variational Autoencoder (L-GMVAE), a model trained to learn a structured latent space where each class label is represented by a set of Gaussian components with diverse, prototypical centroids. Building on this, we present LAPACE (LAtent PAth Counterfactual Explanations), a model-agnostic algorithm that synthesises entire paths of CE points by interpolating from inputs’ latent representations to those learned latent centroids. This approach inherently ensures robustness to input changes, as all paths for a given target class converge to the same fixed centroids. Furthermore, the generated paths provide a spectrum of recourse options, allowing users to navigate the trade-off between proximity and plausibility while also encouraging robustness against model changes. In addition, user-specified actionability constraints can also be easily incorporated via lightweight gradient optimisation through the L-GMVAE’s decoder. Comprehensive experiments show that LAPACE is computationally efficient and achieves competitive performance across eight quantitative metrics.

[921] Focused Skill Discovery: Learning to Control Specific State Variables while Minimizing Side Effects

Jonathan Colaço Carr, Qinyi Sun, Cameron Allen

Main category: cs.LG

TL;DR: A method for learning focused skills that control specific state variables in reinforcement learning, improving exploration efficiency and avoiding negative side effects.

DetailsMotivation: Existing skill discovery algorithms overlook natural state variables, leading to inefficient exploration, difficult skill learning, and negative side effects in downstream tasks.

Method: Introduces a general method that enables skill discovery algorithms to learn focused skills that target and control specific state variables.

Result: Improves state space coverage by a factor of three, unlocks new learning capabilities, and automatically avoids negative side effects in downstream tasks.

Conclusion: The approach successfully addresses limitations of existing skill discovery methods by enabling focused skill learning that controls specific state variables.

Abstract: Skills are essential for unlocking higher levels of problem solving. A common approach to discovering these skills is to learn ones that reliably reach different states, thus empowering the agent to control its environment. However, existing skill discovery algorithms often overlook the natural state variables present in many reinforcement learning problems, meaning that the discovered skills lack control of specific state variables. This can significantly hamper exploration efficiency, make skills more challenging to learn with, and lead to negative side effects in downstream tasks when the goal is under-specified. We introduce a general method that enables these skill discovery algorithms to learn focused skills – skills that target and control specific state variables. Our approach improves state space coverage by a factor of three, unlocks new learning capabilities, and automatically avoids negative side effects in downstream tasks.

[922] A Clinical-grade Universal Foundation Model for Intraoperative Pathology

Zihan Zhao, Fengtao Zhou, Ronggang Li, Bing Chu, Xinke Zhang, Xueyi Zheng, Ke Zheng, Xiaobo Wen, Jiabo Ma, Yihui Wang, Jiewei Chen, Chengyou Zheng, Jiangyu Zhang, Yongqin Wen, Jiajia Meng, Ziqi Zeng, Xiaoqing Li, Jing Li, Dan Xie, Yaping Ye, Yu Wang, Hao Chen, Muyan Cai

Main category: cs.LG

TL;DR: CRISP is a clinical-grade foundation model for intraoperative pathology that achieved high diagnostic accuracy in both retrospective and prospective validation, demonstrating robust generalization across institutions and tumor types while reducing diagnostic workload and improving surgical decision-making.

DetailsMotivation: Intraoperative pathology is crucial for precision surgery but faces challenges due to diagnostic complexity and limited high-quality frozen-section data. Computational pathology has advanced but lacks large-scale prospective validation for routine clinical adoption.

Method: Developed CRISP foundation model on over 100,000 frozen sections from eight medical centers. Evaluated on more than 15,000 intraoperative slides across nearly 100 diagnostic tasks including benign-malignant discrimination, intraoperative decision-making, and pan-cancer detection.

Result: Model demonstrated robust generalization across diverse institutions, tumor types, and anatomical sites. In prospective cohort of 2,000+ patients, sustained high diagnostic accuracy under real-world conditions, directly informed surgical decisions in 92.6% of cases. Human-AI collaboration reduced diagnostic workload by 35%, avoided 105 ancillary tests, and enhanced micrometastases detection with 87.5% accuracy.

Conclusion: CRISP represents a clinical-grade paradigm for AI-driven intraoperative pathology that bridges computational advances with surgical precision, accelerating AI translation into routine clinical practice.

Abstract: Intraoperative pathology is pivotal to precision surgery, yet its clinical impact is constrained by diagnostic complexity and the limited availability of high-quality frozen-section data. While computational pathology has made significant strides, the lack of large-scale, prospective validation has impeded its routine adoption in surgical workflows. Here, we introduce CRISP, a clinical-grade foundation model developed on over 100,000 frozen sections from eight medical centers, specifically designed to provide Clinical-grade Robust Intraoperative Support for Pathology (CRISP). CRISP was comprehensively evaluated on more than 15,000 intraoperative slides across nearly 100 retrospective diagnostic tasks, including benign-malignant discrimination, key intraoperative decision-making, and pan-cancer detection, etc. The model demonstrated robust generalization across diverse institutions, tumor types, and anatomical sites-including previously unseen sites and rare cancers. In a prospective cohort of over 2,000 patients, CRISP sustained high diagnostic accuracy under real-world conditions, directly informing surgical decisions in 92.6% of cases. Human-AI collaboration further reduced diagnostic workload by 35%, avoided 105 ancillary tests and enhanced detection of micrometastases with 87.5% accuracy. Together, these findings position CRISP as a clinical-grade paradigm for AI-driven intraoperative pathology, bridging computational advances with surgical precision and accelerating the translation of artificial intelligence into routine clinical practice.

[923] Glocal Information Bottleneck for Time Series Imputation

Jie Yang, Kexin Zhang, Guibin Zhang, Philip S. Yu, Kaize Ding

Main category: cs.LG

TL;DR: Proposes Glocal-IB, a model-agnostic training paradigm that addresses the optimization dilemma in time series imputation under high missing rates by aligning latent representations to preserve both global structure and local details.

DetailsMotivation: Existing time series imputation models perform well in training but produce poor imputations and distorted latent representations during inference under high missing rates, revealing a critical optimization dilemma where current objectives lack global guidance.

Method: Extends the standard Information Bottleneck framework by introducing a Global Alignment loss derived from tractable mutual information approximation, which aligns latent representations of masked inputs with their originally observed counterparts.

Result: Extensive experiments on nine datasets confirm consistently improved performance and aligned latent representations under missingness, with better generalization under high missing rates.

Conclusion: Glocal-IB effectively addresses the optimization dilemma in time series imputation by helping models retain global structure and local details while suppressing noise from missing values, leading to improved generalization.

Abstract: Time Series Imputation (TSI), which aims to recover missing values in temporal data, remains a fundamental challenge due to the complex and often high-rate missingness in real-world scenarios. Existing models typically optimize the point-wise reconstruction loss, focusing on recovering numerical values (local information). However, we observe that under high missing rates, these models still perform well in the training phase yet produce poor imputations and distorted latent representation distributions (global information) in the inference phase. This reveals a critical optimization dilemma: current objectives lack global guidance, leading models to overfit local noise and fail to capture global information of the data. To address this issue, we propose a new training paradigm, Glocal Information Bottleneck (Glocal-IB). Glocal-IB is model-agnostic and extends the standard IB framework by introducing a Global Alignment loss, derived from a tractable mutual information approximation. This loss aligns the latent representations of masked inputs with those of their originally observed counterparts. It helps the model retain global structure and local details while suppressing noise caused by missing values, giving rise to better generalization under high missingness. Extensive experiments on nine datasets confirm that Glocal-IB leads to consistently improved performance and aligned latent representations under missingness. Our code implementation is available in https://github.com/Muyiiiii/NeurIPS-25-Glocal-IB.

[924] Flow-Matching Based Refiner for Molecular Conformer Generation

Xiangyang Xu, Hongyang Gao

Main category: cs.LG

TL;DR: Proposes a flow-matching refiner for molecular conformer generation that improves sample quality by initializing from mixed-quality outputs and bypassing low-SNR phases.

DetailsMotivation: Existing denoising-based methods for molecular conformer generation suffer from error accumulation during sampling, particularly in low-SNR steps that are difficult to train.

Method: Uses a flow-matching refiner that initializes sampling from mixed-quality outputs of upstream denoising models and reschedules noise scale to avoid low-SNR phases.

Result: On GEOM-QM9 and GEOM-Drugs datasets, the generator-refiner pipeline improves quality with fewer total denoising steps while maintaining diversity.

Conclusion: The proposed flow-matching refiner effectively addresses error accumulation in molecular conformer generation, achieving better quality with reduced computational cost.

Abstract: Low-energy molecular conformers generation (MCG) is a foundational yet challenging problem in drug discovery. Denoising-based methods include diffusion and flow-matching methods that learn mappings from a simple base distribution to the molecular conformer distribution. However, these approaches often suffer from error accumulation during sampling, especially in the low SNR steps, which are hard to train. To address these challenges, we propose a flow-matching refiner for the MCG task. The proposed method initializes sampling from mixed-quality outputs produced by upstream denoising models and reschedules the noise scale to bypass the low-SNR phase, thereby improving sample quality. On the GEOM-QM9 and GEOM-Drugs benchmark datasets, the generator-refiner pipeline improves quality with fewer total denoising steps while preserving diversity.

[925] Federated Self-Supervised Learning for Automatic Modulation Classification under Non-IID and Class-Imbalanced Data

Usman Akram, Yiyue Chen, Haris Vikalo

Main category: cs.LG

TL;DR: FedSSL-AMC uses federated self-supervised learning with triplet-loss CNN on unlabeled I/Q data, followed by per-client SVMs on small labeled sets, achieving better performance than supervised FL under various channel conditions.

DetailsMotivation: Centralized AMC training raises privacy concerns, communication overhead, and lacks robustness to channel shifts. Standard FL is sensitive to class imbalance, non-IID distributions, and limited labeled samples.

Method: Train causal time-dilated CNN with triplet-loss self-supervision on unlabeled I/Q sequences across clients, then use per-client SVMs on small labeled sets. Provides convergence guarantees for federated representation learning.

Result: Experiments on synthetic and over-the-air datasets show consistent gains over supervised FL baselines under heterogeneous SNR, carrier-frequency offsets, and non-IID label partitions.

Conclusion: FedSSL-AMC effectively addresses privacy, communication, and robustness issues in AMC by combining federated self-supervised learning with efficient downstream classification.

Abstract: Training automatic modulation classification (AMC) models on centrally aggregated data raises privacy concerns, incurs communication overhead, and often fails to confer robustness to channel shifts. Federated learning (FL) avoids central aggregation by training on distributed clients but remains sensitive to class imbalance, non-IID client distributions, and limited labeled samples. We propose FedSSL-AMC, which trains a causal, time-dilated CNN with triplet-loss self-supervision on unlabeled I/Q sequences across clients, followed by per-client SVMs on small labeled sets. We establish convergence of the federated representation learning procedure and a separability guarantee for the downstream classifier under feature noise. Experiments on synthetic and over-the-air datasets show consistent gains over supervised FL baselines under heterogeneous SNR, carrier-frequency offsets, and non-IID label partitions.

[926] Benchmarking M-LTSF: Frequency and Noise-Based Evaluation of Multivariate Long Time Series Forecasting Models

Nick Janßen, Melanie Schaller, Bodo Rosenhahn

Main category: cs.LG

TL;DR: Proposes a simulation-based evaluation framework using parameterizable synthetic datasets to systematically assess M-LTSF model robustness under controlled noise and signal conditions, revealing model-specific strengths and vulnerabilities.

DetailsMotivation: Current evaluations of multivariate long-term time series forecasting models rely on real-world datasets with unknown noise properties, making it challenging to understand model robustness systematically.

Method: Developed a simulation-based framework that generates synthetic datasets with configurable signal components, noise types, signal-to-noise ratios, and frequency characteristics to enable controlled evaluation of four M-LTSF architectures.

Result: All models degrade severely when lookback windows miss complete seasonal periods. S-Mamba and Autoformer excel on sawtooth patterns, while R-Linear and iTransformer prefer sinusoidal signals. S-Mamba shows trend-noise vulnerability, iTransformer shows seasonal-noise vulnerability, and both achieve superior frequency reconstruction.

Conclusion: The synthetic testbed provides deeper insights into model-specific strengths and limitations, offering concrete guidance for model selection based on signal characteristics and noise conditions through systematic evaluation.

Abstract: Understanding the robustness of deep learning models for multivariate long-term time series forecasting (M-LTSF) remains challenging, as evaluations typically rely on real-world datasets with unknown noise properties. We propose a simulation-based evaluation framework that generates parameterizable synthetic datasets, where each dataset instance corresponds to a different configuration of signal components, noise types, signal-to-noise ratios, and frequency characteristics. These configurable components aim to model real-world multivariate time series data without the ambiguity of unknown noise. This framework enables fine-grained, systematic evaluation of M-LTSF models under controlled and diverse scenarios. We benchmark four representative architectures S-Mamba (state-space), iTransformer (transformer-based), R-Linear (linear), and Autoformer (decomposition-based). Our analysis reveals that all models degrade severely when lookback windows cannot capture complete periods of seasonal patters in the data. S-Mamba and Autoformer perform best on sawtooth patterns, while R-Linear and iTransformer favor sinusoidal signals. White and Brownian noise universally degrade performance with lower signal-to-noise ratio while S-Mamba shows specific trend-noise and iTransformer shows seasonal-noise vulnerability. Further spectral analysis shows that S-Mamba and iTransformer achieve superior frequency reconstruction. This controlled approach, based on our synthetic and principle-driven testbed, offers deeper insights into model-specific strengths and limitations through the aggregation of MSE scores and provides concrete guidance for model selection based on signal characteristics and noise conditions.

[927] Feasibility-Aware Decision-Focused Learning for Predicting Parameters in the Constraints

Jayanta Mandi, Marianne Defresne, Senne Berden, Tias Guns

Main category: cs.LG

TL;DR: A decision-focused learning framework for predicting constraint parameters in constrained optimization problems with uncertainty, featuring two novel loss functions to balance feasibility and decision quality.

DetailsMotivation: When parameters in constraints are uncertain, predicted parameters can lead to infeasible solutions, requiring simultaneous management of both feasibility and decision quality in predict-then-optimize problems.

Method: Developed a DFL framework with two novel MLE-based loss functions: one penalizes infeasibility and another penalizes suboptimal decisions, combined with a tunable parameter to balance the trade-off.

Result: Experimental results show that adjusting the tunable parameter provides control over the trade-off between suboptimality and feasibility, and for a single parameter value, the method matches existing baselines on both metrics.

Conclusion: The proposed framework effectively balances feasibility and decision quality in constrained optimization problems with uncertain parameters, offering decision-makers flexible control over this trade-off.

Abstract: When some parameters of a constrained optimization problem (COP) are uncertain, this gives rise to a predict-then-optimize (PtO) problem, comprising two stages – the prediction of the unknown parameters from contextual information and the subsequent optimization using those predicted parameters. Decision-focused learning (DFL) implements the first stage by training a machine learning (ML) model to optimize the quality of the decisions made using the predicted parameters. When parameters in the constraints of a COP are predicted, the predicted parameters can lead to infeasible solutions. Therefore, it is important to simultaneously manage both feasibility and decision quality. We develop a DFL framework for predicting constraint parameters in a generic COP. While prior works typically assume that the underlying optimization problem is a linear program (LP) or integer linear program (ILP), our approach makes no such assumption. We derive two novel loss functions based on maximum likelihood estimation (MLE): the first one penalizes infeasibility (by penalizing when the predicted parameters lead to infeasible solutions), and the second one penalizes suboptimal decisions (by penalizing when the true optimal solution is infeasible under the predicted parameters). We introduce a single tunable parameter to form a weighted average of the two losses, allowing decision-makers to balance suboptimality and feasibility. We experimentally demonstrate that adjusting this parameter provides a decision-maker the control over the trade-off between the two. Moreover, across several COP instances, we find that for a single value of the tunable parameter, our method matches the performance of the existing baselines on suboptimality and feasibility.

[928] DP-HYPE: Distributed Differentially Private Hyperparameter Search

Johannes Liebenow, Thorsten Peinemann, Esfandiar Mohammadi

Main category: cs.LG

TL;DR: DP-HYPE is a distributed, privacy-preserving hyperparameter tuning algorithm that uses local client evaluations and voting to find consensus hyperparameters while maintaining client-level differential privacy.

DetailsMotivation: Existing differentially private hyperparameter tuning methods are either computationally expensive, client-specific, or have poor utility-privacy trade-offs. There's a need for scalable distributed hyperparameter tuning that preserves privacy.

Method: DP-HYPE performs distributed voting based on local hyperparameter evaluations from clients to select hyperparameters supported by the majority, ensuring client-level differential privacy without dependency on hyperparameter count.

Result: The algorithm maintains high utility even under small privacy budgets, works in both iid and non-iid settings, and is implemented in the Flower framework for distributed ML.

Conclusion: DP-HYPE provides an efficient, scalable solution for distributed hyperparameter tuning with strong privacy guarantees and good performance across various data distributions.

Abstract: The tuning of hyperparameters in distributed machine learning can substantially impact model performance. When the hyperparameters are tuned on sensitive data, privacy becomes an important challenge and to this end, differential privacy has emerged as the de facto standard for provable privacy. A standard setting when performing distributed learning tasks is that clients agree on a shared setup, i.e., find a compromise from a set of hyperparameters, like the learning rate of the model to be trained. Yet, prior work on differentially private hyperparameter tuning either uses computationally expensive cryptographic protocols, determines hyperparameters separately for each client, or applies differential privacy locally, which can lead to undesirable utility-privacy trade-offs. In this work, we present our algorithm DP-HYPE, which performs a distributed and privacy-preserving hyperparameter search by conducting a distributed voting based on local hyperparameter evaluations of clients. In this way, DP-HYPE selects hyperparameters that lead to a compromise supported by the majority of clients, while maintaining scalability and independence from specific learning tasks. We prove that DP-HYPE preserves the strong notion of differential privacy called client-level differential privacy and, importantly, show that its privacy guarantees do not depend on the number of hyperparameters. We also provide bounds on its utility guarantees, that is, the probability of reaching a compromise, and implement DP-HYPE as a submodule in the popular Flower framework for distributed machine learning. In addition, we evaluate performance on multiple benchmark data sets in iid as well as multiple non-iid settings and demonstrate high utility of DP-HYPE even under small privacy budgets.

[929] How Different from the Past? Spatio-Temporal Time Series Forecasting with Self-Supervised Deviation Learning

Haotian Gao, Zheng Dong, Jiawei Yong, Shintaro Fukushima, Kenjiro Taura, Renhe Jiang

Main category: cs.LG

TL;DR: ST-SSDL is a spatio-temporal forecasting framework that uses self-supervised deviation learning to capture dynamic deviations between current inputs and historical patterns, improving forecasting accuracy.

DetailsMotivation: Existing spatio-temporal forecasting methods often fail to account for dynamic deviations between current inputs and historical patterns, which contain critical signals affecting model performance.

Method: The framework anchors inputs to historical averages, discretizes latent space using learnable prototypes for typical patterns, and uses contrastive loss and deviation loss to refine the structure and quantify deviations.

Result: Experiments on six benchmark datasets show ST-SSDL consistently outperforms state-of-the-art baselines across multiple metrics, with visualizations demonstrating adaptive response to varying deviation levels.

Conclusion: ST-SSDL effectively captures and utilizes dynamic deviations through self-supervised learning, improving generalization and performance in complex spatio-temporal scenarios.

Abstract: Spatio-temporal forecasting is essential for real-world applications such as traffic management and urban computing. Although recent methods have shown improved accuracy, they often fail to account for dynamic deviations between current inputs and historical patterns. These deviations contain critical signals that can significantly affect model performance. To fill this gap, we propose ST-SSDL, a Spatio-Temporal time series forecasting framework that incorporates a Self-Supervised Deviation Learning scheme to capture and utilize such deviations. ST-SSDL anchors each input to its historical average and discretizes the latent space using learnable prototypes that represent typical spatio-temporal patterns. Two auxiliary objectives are proposed to refine this structure: a contrastive loss that enhances inter-prototype discriminability and a deviation loss that regularizes the distance consistency between input representations and corresponding prototypes to quantify deviation. Optimized jointly with the forecasting objective, these components guide the model to organize its hidden space and improve generalization across diverse input conditions. Experiments on six benchmark datasets show that ST-SSDL consistently outperforms state-of-the-art baselines across multiple metrics. Visualizations further demonstrate its ability to adaptively respond to varying levels of deviation in complex spatio-temporal scenarios. Our code and datasets are available at https://github.com/Jimmy-7664/ST-SSDL.

[930] Egalitarian Gradient Descent: A Simple Approach to Accelerated Grokking

Ali Saheb Pasand, Elvis Dohmatob

Main category: cs.LG

TL;DR: Grokking is when test performance stagnates for many epochs then suddenly jumps to near-perfect levels. The paper shows grokking is caused by asymmetric gradient descent speeds along different directions, and proposes Egalitarian Gradient Descent (EGD) that normalizes gradients to equalize speeds, eliminating plateaus.

DetailsMotivation: To reduce the long plateaus in grokking phenomenon where test performance stagnates before sudden improvement, making learning 'grok' faster.

Method: Propose Egalitarian Gradient Descent (EGD) - a modified form of natural gradient descent that normalizes gradients so dynamics along all principal directions evolve at the same speed.

Result: EGD groks much faster, sometimes completely removing stagnation. Empirical results show elimination of plateaus on classical arithmetic problems like modular addition and sparse parity.

Conclusion: Grokking can be induced by asymmetric gradient descent speeds, and EGD effectively addresses this by equalizing learning speeds across all directions, significantly reducing or eliminating grokking plateaus.

Abstract: Grokking is the phenomenon whereby, unlike the training performance, which peaks early in the training process, the test/generalization performance of a model stagnates over arbitrarily many epochs and then suddenly jumps to usually close to perfect levels. In practice, it is desirable to reduce the length of such plateaus, that is to make the learning process “grok” faster. In this work, we provide new insights into grokking. First, we show both empirically and theoretically that grokking can be induced by asymmetric speeds of (stochastic) gradient descent, along different principal (i.e singular directions) of the gradients. We then propose a simple modification that normalizes the gradients so that dynamics along all the principal directions evolves at exactly the same speed. Then, we establish that this modified method, which we call egalitarian gradient descent (EGD) and can be seen as a carefully modified form of natural gradient descent, groks much faster. In fact, in some cases the stagnation is completely removed. Finally, we empirically show that on classical arithmetic problems such as modular addition and sparse parity problem which this stagnation has been widely observed and intensively studied, that our proposed method eliminates the plateaus.

[931] Rethinking Langevin Thompson Sampling from A Stochastic Approximation Perspective

Weixin Wang, Haoyang Zheng, Guang Lin, Wei Deng, Pan Xu

Main category: cs.LG

TL;DR: TS-SA introduces stochastic approximation into Thompson Sampling for multi-armed bandits, creating a stationary posterior approximation that eliminates round-specific tuning and enables fixed step-sizes with improved performance.

DetailsMotivation: Existing approximate Thompson Sampling methods require round-specific tuning of hyperparameters like dynamic learning rates due to non-stationary posterior distributions, making theoretical analysis and practical implementation challenging.

Method: TS-SA incorporates stochastic approximation within TS framework: constructs posterior approximation using recent rewards, performs Langevin Monte Carlo update, and applies SA step to average noisy proposals over time, approximating a stationary posterior target.

Result: Establishes near-optimal regret bounds with simplified theoretical analysis, and empirical results show single-step Langevin update with warm-up substantially outperforms existing methods on bandit tasks.

Conclusion: TS-SA provides a unified framework for approximate Thompson Sampling with stationary posterior approximation, fixed step-sizes, improved theoretical analysis, and superior empirical performance compared to existing methods.

Abstract: Most existing approximate Thompson Sampling (TS) algorithms for multi-armed bandits use Stochastic Gradient Langevin Dynamics (SGLD) or its variants in each round to sample from the posterior, relaxing the need for conjugacy assumptions between priors and reward distributions in vanilla TS. However, they often require approximating a different posterior distribution in different round of the bandit problem. This requires tricky, round-specific tuning of hyperparameters such as dynamic learning rates, causing challenges in both theoretical analysis and practical implementation. To alleviate this non-stationarity, we introduce TS-SA, which incorporates stochastic approximation (SA) within the TS framework. In each round, TS-SA constructs a posterior approximation only using the most recent reward(s), performs a Langevin Monte Carlo (LMC) update, and applies an SA step to average noisy proposals over time. This can be interpreted as approximating a stationary posterior target throughout the entire algorithm, which further yields a fixed step-size, a unified convergence analysis framework, and improved posterior estimates through temporal averaging. We establish near-optimal regret bounds for TS-SA, with a simplified and more intuitive theoretical analysis enabled by interpreting the entire algorithm as a simulation of a stationary SGLD process. Our empirical results demonstrate that even a single-step Langevin update with certain warm-up outperforms existing methods substantially on bandit tasks.

[932] StructuralDecompose: A Modular Framework for Robust Time Series Decomposition in R

Allen Daniel Sunny

Main category: cs.LG

TL;DR: StructuralDecompose is an R package for modular time series decomposition that separates the process into changepoint detection, anomaly detection, smoothing, and decomposition components.

DetailsMotivation: Existing approaches treat decomposition as a monolithic process, lacking flexibility and robustness for different time series characteristics.

Method: Separates time series analysis into distinct components: changepoint detection, anomaly detection, smoothing, and decomposition, allowing users to tailor methods to specific needs.

Result: Demonstrated on simulated and real-world datasets with benchmarking against state-of-the-art tools like Rbeast and autostsm.

Conclusion: The package provides flexibility, robustness, and plays a role in interpretable machine learning workflows.

Abstract: We present StructuralDecompose, an R package for modular and interpretable time series decomposition. Unlike existing approaches that treat decomposition as a monolithic process, StructuralDecompose separates the analysis into distinct components: changepoint detection, anomaly detection, smoothing, and decomposition. This design provides flexibility and robust- ness, allowing users to tailor methods to specific time series characteristics. We demonstrate the package on simulated and real-world datasets, benchmark its performance against state-of-the- art tools such as Rbeast and autostsm, and discuss its role in interpretable machine learning workflows.

[933] Graph-Aware Diffusion for Signal Generation

Sergio Rozada, Vimal K. B., Andrea Cavallo, Antonio G. Marques, Hadi Jamali-Rad, Elvin Isufi

Main category: cs.LG

TL;DR: A graph-aware generative diffusion model (GAD) for generating graph signals that incorporates graph structure through a modified heat equation with time-warped coefficients to prevent exponential decay.

DetailsMotivation: Existing methods for generating graph signals lack generality - they either ignore graph structure or use domain-specific mechanisms. There's a need for general graph-aware diffusion models for applications like recommender systems and sensor networks.

Method: Uses a forward process based on the heat equation with a time-warped coefficient to mitigate exponential decay. Analyzes forward dynamics converging to Gaussian Markov random fields with graph Laplacian covariance, and interprets backward dynamics as graph-signal denoising problems.

Result: Demonstrated advantages of GAD on synthetic data, real traffic speed measurements, and temperature sensor networks, showing improved performance over existing approaches.

Conclusion: GAD provides a general framework for graph-aware generative diffusion modeling that effectively incorporates graph structure while avoiding domain-specific limitations of previous methods.

Abstract: We study the problem of generating graph signals from unknown distributions defined over given graphs, relevant to domains such as recommender systems or sensor networks. Our approach builds on generative diffusion models, which are well established in vision and graph generation but remain underexplored for graph signals. Existing methods lack generality, either ignoring the graph structure in the forward process or designing graph-aware mechanisms tailored to specific domains. We adopt a forward process that incorporates the graph through the heat equation. Rather than relying on the standard formulation, we consider a time-warped coefficient to mitigate the exponential decay of the drift term, yielding a graph-aware generative diffusion model (GAD). We analyze its forward dynamics, proving convergence to a Gaussian Markov random field with covariance parametrized by the graph Laplacian, and interpret the backward dynamics as a sequence of graph-signal denoising problems. Finally, we demonstrate the advantages of GAD on synthetic data, real traffic speed measurements, and a temperature sensor network.

[934] Federated Computation of ROC and PR Curves

Xuefeng Xu, Graham Cormode

Main category: cs.LG

TL;DR: A novel method for approximating ROC and PR curves in federated learning using quantile estimation under distributed differential privacy, with theoretical bounds on area error and empirical validation.

DetailsMotivation: ROC and PR curves are essential for evaluating classifiers but cannot be computed directly in federated learning due to privacy and communication constraints, as the server cannot access raw prediction scores and labels.

Method: Estimate quantiles of the prediction score distribution under distributed differential privacy to approximate ROC and PR curves in a federated setting.

Result: Theoretical bounds on Area Error (AE) between true and estimated curves show trade-offs between accuracy, privacy, and communication. Empirical results demonstrate high approximation accuracy with minimal communication and strong privacy guarantees.

Conclusion: The proposed method is practical for privacy-preserving model evaluation in federated systems, achieving accurate curve approximations while maintaining privacy and efficiency.

Abstract: Receiver Operating Characteristic (ROC) and Precision-Recall (PR) curves are fundamental tools for evaluating machine learning classifiers, offering detailed insights into the trade-offs between true positive rate vs. false positive rate (ROC) or precision vs. recall (PR). However, in Federated Learning (FL) scenarios, where data is distributed across multiple clients, computing these curves is challenging due to privacy and communication constraints. Specifically, the server cannot access raw prediction scores and class labels, which are used to compute the ROC and PR curves in a centralized setting. In this paper, we propose a novel method for approximating ROC and PR curves in a federated setting by estimating quantiles of the prediction score distribution under distributed differential privacy. We provide theoretical bounds on the Area Error (AE) between the true and estimated curves, demonstrating the trade-offs between approximation accuracy, privacy, and communication cost. Empirical results on real-world datasets demonstrate that our method achieves high approximation accuracy with minimal communication and strong privacy guarantees, making it practical for privacy-preserving model evaluation in federated systems.

[935] Test-Time Scaling in Diffusion LLMs via Hidden Semi-Autoregressive Experts

Jihoon Lee, Hoyeon Moon, Kevin Zhai, Arun Kumar Chithanar, Anit Kumar Sahu, Soummya Kar, Chul Lee, Souradip Chakraborty, Amrit Singh Bedi

Main category: cs.LG

TL;DR: HEX is a training-free inference method that ensembles across heterogeneous block schedules in diffusion-based LLMs, boosting reasoning accuracy by leveraging latent semi-autoregressive experts without additional training.

DetailsMotivation: Diffusion-based LLMs learn implicit mixture of semi-autoregressive experts, but fixed inference schedules fail to leverage this latent ensemble, collapsing performance.

Method: HEX performs majority vote over diverse block-sized generation paths, ensembling across heterogeneous block schedules without training.

Result: Boosts GSM8K accuracy from 24.72% to 88.10% (3.56X), MATH from 16.40% to 40.00%, ARC-C from 54.18% to 87.80%, and TruthfulQA from 28.36% to 57.46%.

Conclusion: Sequence of masking plays critical role in inference performance; HEX establishes new paradigm for test-time scaling in dLLMs by leveraging latent ensemble of generation orders.

Abstract: Diffusion-based large language models (dLLMs) are trained flexibly to model extreme dependence in the data distribution; however, how to best utilize this information at inference time remains an open problem. In this work, we uncover an interesting property of these models: dLLMs trained on textual data implicitly learn a mixture of semi-autoregressive experts, where different generation orders reveal different specialized behaviors. We show that committing to any single, fixed inference time schedule, a common practice, collapses performance by failing to leverage this latent ensemble. To address this, we introduce HEX (Hidden semiautoregressive EXperts for test-time scaling), a training-free inference method that ensembles across heterogeneous block schedules. By doing a majority vote over diverse block-sized generation paths, HEX robustly avoids failure modes associated with any single fixed schedule. On reasoning benchmarks such as GSM8K, it boosts accuracy by up to 3.56X (from 24.72% to 88.10%), outperforming top-K margin inference and specialized fine-tuned methods like GRPO, without additional training. HEX even yields significant gains on MATH benchmark from 16.40% to 40.00%, scientific reasoning on ARC-C from 54.18% to 87.80%, and TruthfulQA from 28.36% to 57.46%. Our results establish a new paradigm for test-time scaling in diffusion-based LLMs (dLLMs), revealing that the sequence in which masking is performed plays a critical role in determining performance during inference.

[936] Adaptive Memory Momentum via a Model-Based Framework for Deep Learning Optimization

Kristi Topollai, Anna Choromanska

Main category: cs.LG

TL;DR: The paper introduces an adaptive memory mechanism that replaces constant momentum coefficients in optimizers with dynamic momentum adjusted online during training, outperforming standard methods like SGD and AdamW.

DetailsMotivation: Current momentum-based optimizers use constant momentum coefficients (typically β=0.9) throughout training, which is suboptimal. The authors aim to develop a more adaptive approach that dynamically adjusts momentum during optimization.

Method: Proposes an adaptive memory mechanism using a proximal framework that approximates the objective function with two planes: one from current gradient and another from accumulated past gradients. This allows dynamic adjustment of momentum coefficients online without extra hyperparameter tuning.

Result: The adaptive memory variants of SGD and AdamW outperformed standard methods across various learning tasks, from convex problems to large-scale deep learning scenarios, demonstrating the effectiveness of the approach.

Conclusion: The proposed adaptive memory mechanism is novel, simple to use, and effective, opening new possibilities for introducing adaptivity in optimization algorithms beyond constant momentum coefficients.

Abstract: The vast majority of modern deep learning models are trained with momentum-based first-order optimizers. The momentum term governs the optimizer’s memory by determining how much each past gradient contributes to the current convergence direction. Fundamental momentum methods, such as Nesterov Accelerated Gradient and the Heavy Ball method, as well as more recent optimizers such as AdamW and Lion, all rely on the momentum coefficient that is customarily set to $\beta = 0.9$ and kept constant during model training, a strategy widely used by practitioners, yet suboptimal. In this paper, we introduce an \textit{adaptive memory} mechanism that replaces constant momentum with a dynamic momentum coefficient that is adjusted online during optimization. We derive our method by approximating the objective function using two planes: one derived from the gradient at the current iterate and the other obtained from the accumulated memory of the past gradients. To the best of our knowledge, such a proximal framework was never used for momentum-based optimization. Our proposed approach is novel, extremely simple to use, and does not rely on extra assumptions or hyperparameter tuning. We implement adaptive memory variants of both SGD and AdamW across a wide range of learning tasks, from simple convex problems to large-scale deep learning scenarios, demonstrating that our approach can outperform standard SGD and Adam with hand-tuned momentum coefficients. Finally, our work opens doors for new ways of inducing adaptivity in optimization.

[937] HybridFlow: Quantification of Aleatoric and Epistemic Uncertainty with a Single Hybrid Model

Peter Van Katwyk, Karianne J. Bergen

Main category: cs.LG

TL;DR: HybridFlow is a modular hybrid architecture that unifies aleatoric and epistemic uncertainty modeling using Conditional Masked Autoregressive normalizing flow and flexible probabilistic predictors, improving uncertainty quantification across various regression tasks.

DetailsMotivation: Uncertainty quantification is critical for robustness in high-stakes machine learning applications, addressing the challenge of unifying aleatoric and epistemic uncertainty modeling in Bayesian deep learning.

Method: Combines Conditional Masked Autoregressive normalizing flow for aleatoric uncertainty with flexible probabilistic predictor for epistemic uncertainty, supporting integration with any probabilistic model class.

Result: Improves upon previous uncertainty quantification frameworks across regression tasks including depth estimation, regression benchmarks, and ice sheet emulation. Shows calibrated uncertainty that better aligns with model error than existing methods.

Conclusion: HybridFlow successfully unifies aleatoric and epistemic uncertainty modeling in a single robust framework, addressing a key challenge in Bayesian deep learning.

Abstract: Uncertainty quantification is critical for ensuring robustness in high-stakes machine learning applications. We introduce HybridFlow, a modular hybrid architecture that unifies the modeling of aleatoric and epistemic uncertainty by combining a Conditional Masked Autoregressive normalizing flow for estimating aleatoric uncertainty with a flexible probabilistic predictor for epistemic uncertainty. The framework supports integration with any probabilistic model class, allowing users to easily adapt HybridFlow to existing architectures without sacrificing predictive performance. HybridFlow improves upon previous uncertainty quantification frameworks across a range of regression tasks, such as depth estimation, a collection of regression benchmarks, and a scientific case study of ice sheet emulation. We also provide empirical results of the quantified uncertainty, showing that the uncertainty quantified by HybridFlow is calibrated and better aligns with model error than existing methods for quantifying aleatoric and epistemic uncertainty. HybridFlow addresses a key challenge in Bayesian deep learning, unifying aleatoric and epistemic uncertainty modeling in a single robust framework.

[938] Power Transform Revisited: Numerically Stable, and Federated

Xuefeng Xu, Graham Cormode

Main category: cs.LG

TL;DR: The paper analyzes numerical instabilities in power transforms and proposes remedies, extending them to federated learning with improved stability.

DetailsMotivation: Power transforms are widely used for making data Gaussian-like but suffer from severe numerical instabilities that can lead to incorrect results or crashes.

Method: Comprehensive analysis of instability sources and development of effective remedies, plus extension of power transforms to federated learning addressing numerical and distributional challenges.

Result: Experiments on real-world datasets show the methods are effective and robust, substantially improving stability compared to existing approaches.

Conclusion: The proposed remedies successfully address numerical instabilities in power transforms and enable their reliable use in federated learning settings.

Abstract: Power transforms are popular parametric techniques for making data more Gaussian-like, and are widely used as preprocessing steps in statistical analysis and machine learning. However, we find that direct implementations of power transforms suffer from severe numerical instabilities, which can lead to incorrect results or even crashes. In this paper, we provide a comprehensive analysis of the sources of these instabilities and propose effective remedies. We further extend power transforms to the federated learning setting, addressing both numerical and distributional challenges that arise in this context. Experiments on real-world datasets demonstrate that our methods are both effective and robust, substantially improving stability compared to existing approaches.

[939] TopInG: Topologically Interpretable Graph Learning via Persistent Rationale Filtration

Cheng Xin, Fan Xu, Xin Ding, Jie Gao, Jiaxin Ding

Main category: cs.LG

TL;DR: TopInG is a novel topological framework that uses persistent homology to identify persistent rationale subgraphs in graphs, improving both predictive accuracy and interpretability of Graph Neural Networks.

DetailsMotivation: GNNs lack interpretability which hinders their adoption in critical decision-making, and existing interpretable GNN methods struggle with complex and varied rationale subgraphs.

Method: TopInG employs rationale filtration learning to model an autoregressive generation process of rationale subgraphs and introduces topological discrepancy constraint to enforce persistent topological distinction between rationale and irrelevant subgraphs.

Result: Extensive experiments show TopInG improves upon state-of-the-art methods on both predictive accuracy and interpretation quality, handling variform rationale subgraphs and mitigating spurious correlations.

Conclusion: The proposed topological framework provides theoretical guarantees and practical effectiveness for interpretable graph learning, balancing predictive performance with interpretability.

Abstract: Graph Neural Networks (GNNs) have shown remarkable success across various scientific fields, yet their adoption in critical decision-making is often hindered by a lack of interpretability. Recently, intrinsically interpretable GNNs have been studied to provide insights into model predictions by identifying rationale substructures in graphs. However, existing methods face challenges when the underlying rationale subgraphs are complex and varied. In this work, we propose TopInG: Topologically Interpretable Graph Learning, a novel topological framework that leverages persistent homology to identify persistent rationale subgraphs. TopInG employs a rationale filtration learning approach to model an autoregressive generation process of rationale subgraphs, and introduces a self-adjusted topological constraint, termed topological discrepancy, to enforce a persistent topological distinction between rationale subgraphs and irrelevant counterparts. We provide theoretical guarantees that our loss function is uniquely optimized by the ground truth under specific conditions. Extensive experiments demonstrate TopInG’s effectiveness in tackling key challenges, such as handling variform rationale subgraphs, balancing predictive performance with interpretability, and mitigating spurious correlations. Results show that our approach improves upon state-of-the-art methods on both predictive accuracy and interpretation quality.

[940] Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment

Nevan Wichers, Aram Ebtekar, Ariana Azarbal, Victor Gillioz, Christine Ye, Emil Ryd, Neil Rathi, Henry Sleight, Alex Mallen, Fabien Roger, Samuel Marks

Main category: cs.LG

TL;DR: Inoculation Prompting (IP) is a technique that prevents learning of undesired behaviors in LLMs by explicitly requesting those behaviors during training, reducing reward hacking and sycophancy without compromising desired capabilities.

DetailsMotivation: Large language models are often trained with imperfect oversight signals, leading to problematic behaviors like reward hacking and sycophancy. Improving oversight quality can be expensive or infeasible, so methods are needed to improve behavior despite imperfect training signals.

Method: Inoculation Prompting modifies training prompts to explicitly request undesired behaviors. For example, to prevent reward hacking, prompts request code that only works on provided test cases but fails on other inputs. Prompts that more strongly elicit undesired behavior before fine-tuning are more effective inoculation prompts.

Result: Across four settings, IP reduces learning of undesired behavior without substantially reducing learning of desired capabilities. The technique effectively controls how models generalize from fine-tuning.

Conclusion: IP is a simple yet effective way to prevent learning of undesired behaviors in LLMs without substantially disrupting desired capabilities, providing a practical approach to address imperfect training signals.

Abstract: Large language models are sometimes trained with imperfect oversight signals, leading to undesired behaviors such as reward hacking and sycophancy. Improving oversight quality can be expensive or infeasible, motivating methods that improve learned behavior despite an imperfect training signal. We introduce Inoculation Prompting (IP), a simple but counterintuitive technique that prevents learning of an undesired behavior by modifying training prompts to explicitly request it. For example, to inoculate against reward hacking, we modify the prompts used in supervised fine-tuning to request code that only works on provided test cases but fails on other inputs. Across four settings we find that IP reduces the learning of undesired behavior without substantially reducing the learning of desired capabilities. We also show that prompts which more strongly elicit the undesired behavior prior to fine-tuning more effectively inoculate against the behavior when used during training; this serves as a heuristic to identify promising inoculation prompts. Overall, IP is a simple yet effective way to control how models generalize from fine-tuning, preventing learning of undesired behaviors without substantially disrupting desired capabilities.

[941] KEEP: Integrating Medical Ontologies with Clinical Data for Robust Code Embeddings

Ahmed Elhussein, Paul Meddeb, Abigail Newbury, Jeanne Mirone, Martin Stoll, Gamze Gursoy

Main category: cs.LG

TL;DR: KEEP is a framework that combines knowledge graph embeddings with adaptive learning from clinical data to create medical code representations that preserve ontological relationships while capturing empirical patterns, outperforming traditional and LM-based methods.

DetailsMotivation: Current methods for medical code representation face a trade-off: knowledge graph approaches capture formal relationships but miss real-world patterns, while data-driven methods learn empirical associations but overlook structured knowledge in medical terminologies.

Method: KEEP first generates embeddings from knowledge graphs, then employs regularized training on patient records to adaptively integrate empirical patterns while preserving ontological relationships. It produces final embeddings without task-specific auxiliary or end-to-end training.

Result: Evaluations on structured EHR from UK Biobank and MIMIC-IV show KEEP outperforms both traditional and Language Model-based approaches in capturing semantic relationships and predicting clinical outcomes.

Conclusion: KEEP bridges the gap between knowledge graph and data-driven approaches, providing efficient medical code representations suitable for multiple downstream applications with minimal computational requirements, making it particularly suitable for resource-constrained environments.

Abstract: Machine learning in healthcare requires effective representation of structured medical codes, but current methods face a trade off: knowledge graph based approaches capture formal relationships but miss real world patterns, while data driven methods learn empirical associations but often overlook structured knowledge in medical terminologies. We present KEEP (Knowledge preserving and Empirically refined Embedding Process), an efficient framework that bridges this gap by combining knowledge graph embeddings with adaptive learning from clinical data. KEEP first generates embeddings from knowledge graphs, then employs regularized training on patient records to adaptively integrate empirical patterns while preserving ontological relationships. Importantly, KEEP produces final embeddings without task specific auxiliary or end to end training enabling KEEP to support multiple downstream applications and model architectures. Evaluations on structured EHR from UK Biobank and MIMIC IV demonstrate that KEEP outperforms both traditional and Language Model based approaches in capturing semantic relationships and predicting clinical outcomes. Moreover, KEEP’s minimal computational requirements make it particularly suitable for resource constrained environments.

[942] Modeling Student Learning with 3.8 Million Program Traces

Alexis Ross, Megha Srivastava, Jeremiah Blanchard, Jacob Andreas

Main category: cs.LG

TL;DR: Training language models on real programming interaction traces from students reveals insights about coders’ behavior and enables better modeling of diverse student programming patterns, including helping students recover from mistakes while maintaining their coding style.

DetailsMotivation: To understand what can be learned from programming interaction traces about how students learn to code, including their reasoning processes, exploratory behavior, and skill development patterns.

Method: Introduced a dataset of 3.8M programming reasoning traces from Pencil Code platform users, trained language models on these real traces, and compared with models trained only on final programs or synthetic traces.

Result: Models trained on real traces better model diverse student behavior, can predict properties like goal backtracking and comments from student representations, and can help students recover from mistakes while preserving their coding style.

Conclusion: Many code properties are actually properties of individual students, and training on edit traces produces more steerable models that better predict student behavior and generate programs in final states.

Abstract: As programmers write code, they often edit and retry multiple times, creating rich “interaction traces” that reveal how they approach coding tasks and provide clues about their level of skill development. For novice programmers in particular, these traces reflect the diverse reasoning processes they employ to code, such as exploratory behavior to understand how a programming concept works, re-strategizing in response to bugs, and personalizing stylistic choices. In this work, we explore what can be learned from training language models on such reasoning traces: not just about code, but about coders, and particularly students learning to program. We introduce a dataset of over 3.8 million programming reasoning traces from users of Pencil Code, a free online educational platform used by students to learn simple programming concepts. Compared to models trained only on final programs or synthetically-generated traces, we find that models trained on real traces are stronger at modeling diverse student behavior. Through both behavioral and probing analyses, we also find that many properties of code traces, such as goal backtracking or number of comments, can be predicted from learned representations of the students who write them. Building on this result, we show that we can help students recover from mistakes by steering code generation models to identify a sequence of edits that will results in more correct code while remaining close to the original student’s style. Together, our results suggest that many properties of code are properties of individual students and that training on edit traces can lead to models that are more steerable, more predictive of student behavior while programming, and better at generating programs in their final states. Code and data is available at https://github.com/meghabyte/pencilcode-public

[943] ResCP: Reservoir Conformal Prediction for Time Series Forecasting

Roberto Neglia, Andrea Cini, Michael M. Bronstein, Filippo Maria Bianchi

Main category: cs.LG

TL;DR: Reservoir Conformal Prediction (ResCP) is a training-free conformal prediction method for time series that uses reservoir computing to dynamically reweight conformity scores, achieving asymptotic conditional coverage without complex model training.

DetailsMotivation: Existing conformal prediction methods for sequential data require complex models that fail with small sample sizes and need expensive retraining when data distributions change.

Method: Leverages reservoir computing to compute similarity scores among reservoir states and adaptively reweight observed residuals at each step, accounting for local temporal dynamics without training.

Result: Achieves asymptotic conditional coverage and demonstrates effectiveness across diverse forecasting tasks while maintaining computational scalability.

Conclusion: ResCP provides an efficient, training-free alternative to existing conformal prediction methods for time series that handles small sample sizes and distribution changes effectively.

Abstract: Conformal prediction offers a powerful framework for building distribution-free prediction intervals for exchangeable data. Existing methods that extend conformal prediction to sequential data rely on fitting a relatively complex model to capture temporal dependencies. However, these methods can fail if the sample size is small and often require expensive retraining when the underlying data distribution changes. To overcome these limitations, we propose Reservoir Conformal Prediction (ResCP), a novel training-free conformal prediction method for time series. Our approach leverages the efficiency and representation learning capabilities of reservoir computing to dynamically reweight conformity scores. In particular, we compute similarity scores among reservoir states and use them to adaptively reweight the observed residuals at each step. With this approach, ResCP enables us to account for local temporal dynamics when modeling the error distribution without compromising computational scalability. We prove that, under reasonable assumptions, ResCP achieves asymptotic conditional coverage, and we empirically demonstrate its effectiveness across diverse forecasting tasks.

[944] Boomerang Distillation Enables Zero-Shot Model Size Interpolation

Sara Kangaslahti, Nihal V. Nayak, Jonathan Geuter, Marco Fumero, Francesco Locatello, David Alvarez-Melis

Main category: cs.LG

TL;DR: Boomerang distillation enables creating fine-grained model families by distilling a large teacher model down to a small student, then reconstructing intermediate-sized models by re-incorporating teacher blocks without additional training.

DetailsMotivation: Existing approaches for building model families require training each size independently, which is expensive and provides only coarse-grained size options.

Method: Start with a large teacher model, distill down to a small student, then progressively reconstruct intermediate-sized models by re-incorporating blocks of teacher layers into the student without additional training.

Result: Produces zero-shot interpolated models whose performance scales smoothly between student and teacher, often matching or surpassing pretrained or distilled models of the same size.

Conclusion: Boomerang distillation provides a simple and efficient way to generate fine-grained model families, dramatically reducing training cost while enabling flexible adaptation across deployment environments.

Abstract: Large language models (LLMs) are typically deployed under diverse memory and compute constraints. Existing approaches build model families by training each size independently, which is prohibitively expensive and provides only coarse-grained size options. In this work, we identify a novel phenomenon that we call boomerang distillation: starting from a large base model (the teacher), one first distills down to a small student and then progressively reconstructs intermediate-sized models by re-incorporating blocks of teacher layers into the student without any additional training. This process produces zero-shot interpolated models of many intermediate sizes whose performance scales smoothly between the student and teacher, often matching or surpassing pretrained or distilled models of the same size. We further analyze when this type of interpolation succeeds, showing that alignment between teacher and student through pruning and distillation is essential. Boomerang distillation thus provides a simple and efficient way to generate fine-grained model families, dramatically reducing training cost while enabling flexible adaptation across deployment environments. The code and models are available at https://github.com/dcml-lab/boomerang-distillation.

[945] MICROTRIPS: MICRO-geography TRavel Intelligence and Pattern Synthesis

Yangyang Wang, Tayo Fabusuyi

Main category: cs.LG

TL;DR: Novel small-area estimation framework using microdata and machine learning to improve urban transportation planning with high-resolution travel behavior predictions.

DetailsMotivation: To enhance urban transportation planning by providing detailed characterization of travel behavior at small geographic areas, improving on traditional four-step travel models.

Method: Uses publicly available microdata files and machine learning methods to predict travel behavior for synthetic populations at small geographic areas, enabling high-resolution estimation of trip generation, distribution, mode choice, and route assignment.

Result: Validation with ACS/PUMS work-commute datasets shows higher accuracy compared to conventional approaches, providing granular insights for localized interventions.

Conclusion: The framework enables tailored interventions for localized situations and supports various policy applications including optimal placement of micro-fulfillment centers, curb-space management, and inclusive transportation solutions for vulnerable communities.

Abstract: This study presents a novel small-area estimation framework to enhance urban transportation planning through detailed characterization of travel behavior. Our approach improves on the four-step travel model by employing publicly available microdata files and machine learning methods to predict travel behavior for a representative, synthetic population at small geographic areas. This approach enables high-resolution estimation of trip generation, trip distribution, mode choice, and route assignment. Validation using ACS/PUMS work-commute datasets demonstrates that our framework achieves higher accuracy compared to conventional approaches. The resulting granular insights enable the tailoring of interventions to address localized situations and support a range of policy applications and targeted interventions, including the optimal placement of micro-fulfillment centers, effective curb-space management, and the design of more inclusive transportation solutions particularly for vulnerable communities.

[946] Fed-SB: A Silver Bullet for Extreme Communication Efficiency and Performance in (Private) Federated LoRA Fine-Tuning

Raghav Singhal, Kaustubh Ponkshe, Rohit Vartak, Lav R. Varshney, Praneeth Vepakomma

Main category: cs.LG

TL;DR: Fed-SB introduces a federated fine-tuning method for LLMs using LoRA-SB that achieves optimal performance with significantly reduced communication costs by learning a small square matrix between adapters, enabling exact updates through direct averaging.

DetailsMotivation: Traditional federated fine-tuning using LoRA suffers from suboptimal updates due to federated averaging of individual adapters, leading to either high communication costs that scale with client numbers or performance degradation from limited expressivity.

Method: Uses LoRA-SB which learns a small square matrix (R) between adapters B and A while keeping other components fixed, enabling direct averaging of R that guarantees exact updates and reduces communication costs independent of client numbers.

Result: Achieves state-of-the-art performance across commonsense reasoning, arithmetic reasoning, and language inference tasks while reducing communication costs by up to 230x, with additional benefits in private settings through reduced trainable parameters and avoided noise amplification.

Conclusion: Fed-SB provides a state-of-the-art, efficient, and scalable solution for both private and non-private federated fine-tuning of foundation models, overcoming limitations of existing methods.

Abstract: Low-Rank Adaptation (LoRA) has become ubiquitous for efficiently fine-tuning foundation models. However, federated fine-tuning using LoRA is challenging due to suboptimal updates arising from traditional federated averaging of individual adapters. Existing solutions either incur prohibitively high communication cost that scales linearly with the number of clients or suffer from performance degradation due to limited expressivity. We introduce Federated Silver Bullet (Fed-SB), a novel approach for federated fine-tuning of LLMs using LoRA-SB, a recently proposed low-rank adaptation method. LoRA-SB optimally aligns the optimization trajectory with the ideal low-rank full fine-tuning projection by learning a small square matrix (R) between adapters B and A, keeping other components fixed. Direct averaging of R guarantees exact updates, substantially reducing communication cost, which remains independent of the number of clients, and enables scalability. Fed-SB achieves state-of-the-art performance across commonsense reasoning, arithmetic reasoning, and language inference tasks while reducing communication costs by up to 230x. In private settings, Fed-SB further improves performance by (1) reducing trainable parameters, thereby lowering the noise required for differential privacy and (2) avoiding noise amplification introduced by other methods. Overall, Fed-SB offers a state-of-the-art, efficient, and scalable solution for both private and non-private federated fine-tuning. Our code is publicly available at: https://github.com/CERT-Lab/fed-sb.

[947] DISC: Dynamic Decomposition Improves LLM Inference Scaling

Jonathan Light, Wei Cheng, Benjamin Riviere, Wu Yue, Masafumi Oyamada, Mengdi Wang, Yisong Yue, Santiago Paternain, Haifeng Chen

Main category: cs.LG

TL;DR: Dynamic decomposition adaptively partitions reasoning traces into steps during LLM inference, improving efficiency by better allocating compute to challenging steps.

DetailsMotivation: Current inference scaling methods use predetermined step sizes that may not optimally allocate compute, especially for complex problems where some steps require more processing than others.

Method: Proposes dynamic decomposition that automatically and adaptively partitions solution and reasoning traces into manageable steps during inference, subdividing challenging steps and prioritizing their sampling.

Result: Outperforms static approaches (token-level, sentence-level, single-step) on APPS, MATH, and LiveCodeBench benchmarks, reducing pass@10 error rate by 5.0%, 6.7%, and 10.5% respectively.

Conclusion: Dynamic decomposition significantly improves inference efficiency and has potential to enhance various inference scaling techniques by better adapting to problem complexity.

Abstract: Inference scaling methods for LLMs often rely on decomposing problems into steps (or groups of tokens), followed by sampling and selecting the best next steps. However, these steps and their sizes are often predetermined or manually designed based on domain knowledge. We propose dynamic decomposition, a method that adaptively and automatically partitions solution and reasoning traces into manageable steps during inference. By more effectively allocating compute – particularly through subdividing challenging steps and prioritizing their sampling – dynamic decomposition significantly improves inference efficiency. Experiments on benchmarks such as APPS, MATH, and LiveCodeBench demonstrate that dynamic decomposition outperforms static approaches, including token-level, sentence-level, and single-step decompositions, reducing the pass@10 error rate by 5.0%, 6.7%, and 10.5% respectively. These findings highlight the potential of dynamic decomposition to improve a wide range of inference scaling techniques.

[948] Understanding R1-Zero-Like Training: A Critical Perspective

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, Min Lin

Main category: cs.LG

TL;DR: The paper critically analyzes R1-Zero-like training, revealing pretraining biases in base models and optimization bias in GRPO, then introduces Dr. GRPO to address these issues and achieves state-of-the-art results with a minimalist recipe.

DetailsMotivation: To understand how base model pretraining characteristics influence RL performance and identify biases in current R1-Zero training approaches.

Method: Analyzed various base models, identified optimization bias in GRPO, and introduced Dr. GRPO as an unbiased optimization method that improves token efficiency.

Result: Achieved 43.3% accuracy on AIME 2024 with a 7B base model, establishing new state-of-the-art performance.

Conclusion: The minimalist R1-Zero recipe with Dr. GRPO optimization effectively addresses identified biases and achieves superior reasoning performance.

Abstract: DeepSeek-R1-Zero has shown that reinforcement learning (RL) at scale can directly enhance the reasoning capabilities of LLMs without supervised fine-tuning. In this work, we critically examine R1-Zero-like training by analyzing its two core components: base models and RL. We investigate a wide range of base models, including DeepSeek-V3-Base, to understand how pretraining characteristics influence RL performance. Our analysis reveals that DeepSeek-V3-Base already exhibit ‘‘Aha moment’’, while Qwen2.5 base models demonstrate strong reasoning capabilities even without prompt templates, suggesting potential pretraining biases. Additionally, we identify an optimization bias in Group Relative Policy Optimization (GRPO), which artificially increases response length (especially for incorrect outputs) during training. To address this, we introduce Dr. GRPO, an unbiased optimization method that improves token efficiency while maintaining reasoning performance. Leveraging these insights, we present a minimalist R1-Zero recipe that achieves 43.3% accuracy on AIME 2024 with a 7B base model, establishing a new state-of-the-art. Our code is available at https://github.com/sail-sg/understand-r1-zero.

[949] TRA: Better Length Generalisation with Threshold Relative Attention

Mattia Opper, Roland Fernandez, Paul Smolensky, Jianfeng Gao

Main category: cs.LG

TL;DR: Transformers struggle with length generalization due to self-attention failures: inability to remove irrelevant information and positional biases up-weighting irrelevant keys. Proposed mitigations include selective sparsity and contextualized relative distance.

DetailsMotivation: Transformers show poor performance on basic tasks when generalizing to longer sequences, which may be explained by limitations in the self-attention mechanism.

Method: Refactored attention mechanism with two mitigations: selective sparsity (completely removing irrelevant keys from attention softmax) and contextualized relative distance (only considering distance between query and relevant keys).

Result: The proposed approach substantially improves generalization capabilities of decoder-only transformers.

Conclusion: The combination of selective sparsity and contextualized relative distance effectively addresses key failure cases in self-attention, leading to better length generalization in transformers.

Abstract: Transformers struggle with length generalisation, displaying poor performance even on basic tasks. We test whether these limitations can be explained through two key failures of the self-attention mechanism. The first is the inability to fully remove irrelevant information. The second is tied to position, even if the dot product between a key and query is highly negative (i.e. an irrelevant key) learned positional biases may unintentionally up-weight such information - dangerous when distances become out of distribution. Put together, these two failure cases lead to compounding generalisation difficulties. We test whether they can be mitigated through the combination of a) selective sparsity - completely removing irrelevant keys from the attention softmax and b) contextualised relative distance - distance is only considered as between the query and the keys that matter. We show how refactoring the attention mechanism with these two mitigations in place can substantially improve the generalisation capabilities of decoder only transformers.

[950] AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories

Xing Han Lù, Amirhossein Kazemnejad, Nicholas Meade, Arkil Patel, Dongchan Shin, Alejandra Zambrano, Karolina Stańczak, Peter Shaw, Christopher J. Pal, Siva Reddy

Main category: cs.LG

TL;DR: AgentRewardBench is the first benchmark to evaluate LLM judges for assessing web agent trajectories, revealing that no single LLM excels across all benchmarks and that rule-based evaluation underreports success rates.

DetailsMotivation: Current evaluation methods for web agents have limitations - rule-based approaches are hard to extend to new tasks and may miss successful trajectories, while human evaluation is slow and expensive. LLM-based automatic evaluation could provide faster, cost-effective alternatives but their effectiveness is unknown.

Method: Created AgentRewardBench with 1302 trajectories across 5 benchmarks and 4 LLMs. Each trajectory was reviewed by experts who assessed success, side effects, and repetitiveness. Used this benchmark to evaluate 12 LLM judges.

Result: No single LLM performed well across all benchmarks. Rule-based evaluation used by common benchmarks tends to underreport the success rate of web agents, showing a key weakness of current evaluation methods.

Conclusion: There is a need to develop more flexible automatic evaluations for web agents, as current rule-based methods are insufficient and LLM judges show inconsistent performance across different benchmarks.

Abstract: Web agents enable users to perform tasks on web browsers through natural language interaction. Evaluating web agents trajectories is an important problem, since it helps us determine whether the agent successfully completed the tasks. Rule-based methods are widely used for this purpose, but they are challenging to extend to new tasks and may not always recognize successful trajectories. We may achieve higher accuracy through human evaluation, but the process would be substantially slower and more expensive. Automatic evaluations with LLMs may avoid the challenges of designing new rules and manually annotating trajectories, enabling faster and cost-effective evaluation. However, it is unclear how effective they are at evaluating web agents. To this end, we propose AgentRewardBench, the first benchmark to assess the effectiveness of LLM judges for evaluating web agents. AgentRewardBench contains 1302 trajectories across 5 benchmarks and 4 LLMs. Each trajectory in AgentRewardBench is reviewed by an expert, who answers questions pertaining to the success, side effects, and repetitiveness of the agent. Using our benchmark, we evaluate 12 LLM judges and find that no single LLM excels across all benchmarks. We also find that the rule-based evaluation used by common benchmarks tends to underreport the success rate of web agents, highlighting a key weakness of rule-based evaluation and the need to develop more flexible automatic evaluations. We release the benchmark at: https://agent-reward-bench.github.io

[951] Safety Subspaces are Not Linearly Distinct: A Fine-Tuning Case Study

Kaustubh Ponkshe, Shaan Shah, Raghav Singhal, Praneeth Vepakomma

Main category: cs.LG

TL;DR: Safety alignment in LLMs is highly entangled with general learning components rather than residing in distinct subspaces, making subspace-based defenses fundamentally limited.

DetailsMotivation: To investigate whether safety alignment corresponds to identifiable directions in weight space that could be isolated to defend against misalignment during fine-tuning.

Method: Comprehensive empirical study examining safety-relevant behavior in both weight and activation spaces across five open-source LLMs from Llama and Qwen families.

Result: Subspaces that amplify safe behaviors also amplify useful ones, and prompts with different safety implications activate overlapping representations - safety is highly entangled with general learning components.

Conclusion: Subspace-based defenses face fundamental limitations, underscoring the need for alternative strategies to preserve safety under continued training.

Abstract: Large Language Models (LLMs) rely on safety alignment to produce socially acceptable responses. However, this behavior is known to be brittle: further fine-tuning, even on benign or lightly contaminated data, can degrade safety and reintroduce harmful behaviors. A growing body of work suggests that alignment may correspond to identifiable directions in weight space, forming subspaces that could, in principle, be isolated or preserved to defend against misalignment. In this work, we conduct a comprehensive empirical study of this perspective. We examine whether safety-relevant behavior is concentrated in specific linear subspaces, whether it can be separated from general-purpose learning, and whether harmfulness arises from distinguishable patterns in activations. Across both weight and activation spaces, our findings are consistent: subspaces that amplify safe behaviors also amplify useful ones, and prompts with different safety implications activate overlapping representations. Rather than residing in distinct directions, we show that safety is highly entangled with the general learning components of the model. This suggests that subspace-based defenses face fundamental limitations and underscores the need for alternative strategies to preserve safety under continued training. We corroborate these findings with multiple experiments on five open-source LLMs from the Llama and Qwen families. Our code is publicly available at: https://github.com/CERT-Lab/safety-subspaces.

[952] AdaSTaR: Adaptive Data Sampling for Training Self-Taught Reasoners

Woosung Koh, Wonbeen Oh, Jaein Jang, MinHyung Lee, Hyeongjin Kim, Ah Yeon Kim, Joonkee Kim, Junghyun Lee, Taehyeon Kim, Se-Young Yun

Main category: cs.LG

TL;DR: AdaSTaR improves self-improving reasoning LMs by using adaptive sampling for diversity and curriculum, achieving better accuracy with 58.6% fewer training FLOPs.

DetailsMotivation: Current self-improving LMs suffer from trained observation imbalance due to random sampling, leading to inefficient training on solved examples while under-training on challenging ones.

Method: Introduces Adaptive STaR (AdaSTaR) with two principles: (1) Adaptive Sampling for Diversity to promote balanced training across observations, and (2) Adaptive Sampling for Curriculum to dynamically adjust data difficulty matching model’s evolving strength.

Result: Across six benchmarks, AdaSTaR achieves best test accuracy in all instances (6/6) and reduces training FLOPs by an average of 58.6% compared to baselines.

Conclusion: AdaSTaR enables more efficient and effective self-improving LMs, with improvements generalizing to different pre-trained LMs and larger models.

Abstract: Self-Taught Reasoners (STaR), synonymously known as Rejection sampling Fine-Tuning (RFT), is an integral part of the training pipeline of self-improving reasoning Language Models (LMs). The self-improving mechanism often employs random observation (data) sampling. However, this results in trained observation imbalance; inefficiently over-training on solved examples while under-training on challenging ones. In response, we introduce Adaptive STaR (AdaSTaR), a novel algorithm that rectifies this by integrating two adaptive sampling principles: (1) Adaptive Sampling for Diversity: promoting balanced training across observations, and (2) Adaptive Sampling for Curriculum: dynamically adjusting data difficulty to match the model’s evolving strength. Across six benchmarks, AdaSTaR achieves best test accuracy in all instances (6/6) and reduces training FLOPs by an average of 58.6% against an extensive list of baselines. These improvements in performance and efficiency generalize to different pre-trained LMs and larger models, paving the way for more efficient and effective self-improving LMs.

[953] COUNTDOWN: Contextually Sparse Activation Filtering Out Unnecessary Weights in Down Projection

Jaewon Cheon, Pilsung Kang

Main category: cs.LG

TL;DR: The paper proposes COUNTDOWN methods for sparse activation in large language models, reducing FFNN computational costs by selectively deactivating non-essential parameters using linear combination insights.

DetailsMotivation: To address computational inefficiencies in large language models by developing more effective sparse activation methods that reduce inference costs while maintaining performance.

Method: Proposes two methods: M-COUNTDOWN (using indirect coefficients) and D-COUNTDOWN (using direct coefficients) based on the insight that FFNN layer sparsity lies in linear combinations over the down projection matrix.

Result: D-COUNTDOWN can omit 90% of computations with only 5.5% performance loss, while M-COUNTDOWN provides predictor-free solution with 29.4% better performance preservation than existing methods. Kernel implementations achieve real-world acceleration.

Conclusion: The COUNTDOWN methods effectively reduce computational costs in large language models through sparse activation, with D-COUNTDOWN achieving high computation reduction and M-COUNTDOWN offering superior performance preservation without predictors.

Abstract: The growing size of large language models has created significant computational inefficiencies. To address this challenge, sparse activation methods selectively deactivates non-essential parameters during inference, reducing computational costs in FFNN layers. While existing methods focus on non-linear gating mechanisms, we hypothesize that the sparsity of the FFNN layer lies globally in the form of a linear combination over its internal down projection matrix. Based on this insight, we propose two methods: M-COUNTDOWN, leveraging indirect coefficients, and D-COUNTDOWN, utilizing direct coefficients of the linear combination. Experimental results demonstrate that D-COUNTDOWN can omit 90% of computations with performance loss as low as 5.5% ideally, while M-COUNTDOWN provides a predictor-free solution with up to 29.4% better performance preservation compared to existing methods. Our specialized kernel implementations effectively realize these theoretical gains into substantial real-world acceleration.

[954] How Many Parameters Does Your Task Really Need? Task Specific Pruning with LLM-Sieve

Waleed Reda, Abhinav Jangda, Krishna Chintalapudi

Main category: cs.LG

TL;DR: LLM-Sieve is a framework that prunes LLMs to minimal parameter subsets while preserving task performance, using output-aligned projections and adaptive genetic algorithm pruning.

DetailsMotivation: To determine how much of an LLM is truly necessary for specific tasks in resource-constrained settings, enabling more efficient deployment.

Method: Uses output-aligned non-orthogonal projections for better low-rank approximations than PCA/SVD, combined with adaptive pruning via Genetic Algorithm to discover matrix-specific pruning levels.

Result: Removes 20-75% of weights across models from 3.8B to 70B parameters with only 1-5% accuracy loss, outperforming prior pruning methods.

Conclusion: LLM-Sieve enables efficient deployment while revealing bottleneck matrices that concentrate critical knowledge, suggesting architectural implications for future LLM design.

Abstract: As Large Language Models (LLMs) are increasingly deployed for narrow tasks in resource-constrained settings, a central question arises: how much of an LLM is truly necessary for a given task? We present LLM-Sieve, a framework that prunes LLMs down to the minimal parameter subset needed to preserve task performance. Our approach introduces two innovations: (i) output-aligned non-orthogonal projections, which yield more faithful low-rank approximations than traditional PCA/SVD by aligning directly with layer outputs; and (ii) adaptive pruning via a Genetic Algorithm, which automatically discovers matrix-specific pruning levels and exposes the uneven distribution of task-relevant knowledge. Across models from 3.8B to 70B parameters, LLM-Sieve removes 20-75% of weights with only 1-5% accuracy loss-substantially ahead of prior pruning methods. Beyond efficiency, our framework reveals bottleneck matrices that concentrate critical knowledge, suggesting architectural implications for future LLM design. LLM-Sieve integrates seamlessly with LoRA fine-tuning and quantization, enabling both efficient deployment and deeper understanding of knowledge organization in LLMs.

[955] Tutorial on amortized optimization

Brandon Amos

Main category: cs.LG

TL;DR: This tutorial introduces amortized optimization methods that use learning to predict solutions for similar optimization problems, enabling much faster solving than traditional methods.

DetailsMotivation: Optimization is frequently applied to solve similar problem instances repeatedly. Amortized optimization exploits shared structure between instances to improve efficiency.

Method: Uses learning-based approaches to predict solutions by leveraging shared problem structure across similar instances, rather than solving each problem from scratch.

Result: Amortized optimization methods can solve problems orders of magnitude faster than traditional optimization methods without amortization.

Conclusion: Amortized optimization is a powerful approach with broad applications in variational inference, reinforcement learning, control, convex optimization, and other domains.

Abstract: Optimization is a ubiquitous modeling tool and is often deployed in settings which repeatedly solve similar instances of the same problem. Amortized optimization methods use learning to predict the solutions to problems in these settings, exploiting the shared structure between similar problem instances. These methods have been crucial in variational inference and reinforcement learning and are capable of solving optimization problems many orders of magnitudes faster than traditional optimization methods that do not use amortization. This tutorial presents an introduction to the amortized optimization foundations behind these advancements and overviews their applications in variational inference, sparse coding, gradient-based meta-learning, control, reinforcement learning, convex optimization, optimal transport, and deep equilibrium networks. The source code for this tutorial is available at https://github.com/facebookresearch/amortized-optimization-tutorial.

[956] Rethinking Exact Unlearning under Exposure: Extracting Forgotten Data under Exact Unlearning in Large Language Model

Xiaoyu Wu, Yifei Pang, Terrance Liu, Zhiwei Steven Wu

Main category: cs.LG

TL;DR: Exact unlearning, considered the gold standard for privacy protection, may paradoxically increase privacy risks when both pre- and post-unlearning models are accessible, as demonstrated by a novel data extraction attack that leverages model guidance.

DetailsMotivation: To address privacy concerns in LLMs trained on web data containing sensitive information, and to challenge the assumption that exact unlearning effectively mitigates privacy risks in practical deployment settings.

Method: A data extraction attack that uses signals from the pre-unlearning model to guide the post-unlearning model, combined with a token filtering strategy, to uncover patterns reflecting removed data distribution.

Result: The attack significantly improves extraction success rates, doubling performance in some cases across benchmarks (MUSE, TOFU, WMDP), and demonstrates effectiveness on simulated medical diagnosis data.

Conclusion: Unlearning may increase privacy leakage risks in real-world deployments, requiring evaluation methods to consider broader threat models including adversarial access to prior checkpoints.

Abstract: Large Language Models are typically trained on datasets collected from the web, which may inadvertently contain harmful or sensitive personal information. To address growing privacy concerns, unlearning methods have been proposed to remove the influence of specific data from trained models. Of these, exact unlearning – which retrains the model from scratch without the target data – is widely regarded the gold standard for mitigating privacy risks in deployment. In this paper, we revisit this assumption in a practical deployment setting where both the pre- and post-unlearning logits API are exposed, such as in open-weight scenarios. Targeting this setting, we introduce a novel data extraction attack that leverages signals from the pre-unlearning model to guide the post-unlearning model, uncovering patterns that reflect the removed data distribution. Combining model guidance with a token filtering strategy, our attack significantly improves extraction success rates – doubling performance in some cases – across common benchmarks such as MUSE, TOFU, and WMDP. Furthermore, we demonstrate our attack’s effectiveness on a simulated medical diagnosis dataset to highlight real-world privacy risks associated with exact unlearning. In light of our findings, which suggest that unlearning may, in a contradictory way, increase the risk of privacy leakage during real-world deployments, we advocate for evaluation of unlearning methods to consider broader threat models that account not only for post-unlearning models but also for adversarial access to prior checkpoints. Code is publicly available at: https://github.com/Nicholas0228/unlearned_data_extraction_llm.

[957] On amortizing convex conjugates for optimal transport

Brandon Amos

Main category: cs.LG

TL;DR: The paper proposes using amortized optimization to compute the convex conjugate in Wasserstein-2 optimal transport, combining learned approximations with fine-tuning solvers to improve transport map quality.

DetailsMotivation: The convex conjugate in Euclidean Wasserstein-2 optimal transport is difficult to compute exactly in continuous space, limiting practical methods that cannot precisely conjugate dual potentials.

Method: Combines amortized optimization (learning a model to predict the conjugate) with a solver for fine-tuning to approximate the conjugate computation.

Result: Significantly improves quality of transport maps on the Wasserstein-2 benchmark and successfully models many 2-dimensional couplings and flows from literature.

Conclusion: The proposed approach effectively overcomes computational challenges in conjugate computation for optimal transport, with all methods and solvers made publicly available.

Abstract: This paper focuses on computing the convex conjugate (also known as the Legendre-Fenchel conjugate or c-transform) that appears in Euclidean Wasserstein-2 optimal transport. This conjugation is considered difficult to compute and in practice, methods are limited by not being able to exactly conjugate the dual potentials in continuous space. To overcome this, the computation of the conjugate can be approximated with amortized optimization, which learns a model to predict the conjugate. I show that combining amortized approximations to the conjugate with a solver for fine-tuning significantly improves the quality of transport maps learned for the Wasserstein-2 benchmark by Korotin et al. (2021a) and is able to model many 2-dimensional couplings and flows considered in the literature. All baselines, methods, and solvers are publicly available at http://github.com/facebookresearch/w2ot.

[958] Sum-of-Parts: Self-Attributing Neural Networks with End-to-End Learning of Feature Groups

Weiqiu You, Helen Qu, Marco Gatti, Bhuvnesh Jain, Eric Wong

Main category: cs.LG

TL;DR: SOP transforms any differentiable model into a group-based self-attributing neural network that achieves state-of-the-art performance while maintaining interpretability.

DetailsMotivation: Self-attributing neural networks face performance trade-offs, but group-based approaches can achieve zero error and high performance.

Method: Proposed Sum-of-Parts (SOP) framework that transforms differentiable models into group-based SANNs, learning feature groups end-to-end without supervision.

Result: Achieved state-of-the-art performance on vision and language tasks; groups were interpretable on quantitative and semantic metrics; useful for model debugging and scientific discovery.

Conclusion: SOP enables high-performance group-based SANNs that are interpretable and practically useful across domains.

Abstract: Self-attributing neural networks (SANNs) present a potential path towards interpretable models for high-dimensional problems, but often face significant trade-offs in performance. In this work, we formally prove a lower bound on errors of per-feature SANNs, whereas group-based SANNs can achieve zero error and thus high performance. Motivated by these insights, we propose Sum-of-Parts (SOP), a framework that transforms any differentiable model into a group-based SANN, where feature groups are learned end-to-end without group supervision. SOP achieves state-of-the-art performance for SANNs on vision and language tasks, and we validate that the groups are interpretable on a range of quantitative and semantic metrics. We further validate the utility of SOP explanations in model debugging and cosmological scientific discovery. Our code is available at https://github.com/BrachioLab/sop

[959] Can We Infer Confidential Properties of Training Data from LLMs?

Pengrun Huang, Chhavi Yadav, Kamalika Chaudhuri, Ruihan Wu

Main category: cs.LG

TL;DR: PropInfer is a benchmark for evaluating property inference attacks on LLMs, showing they can reveal sensitive dataset properties after fine-tuning.

DetailsMotivation: LLMs are increasingly fine-tuned on sensitive domain datasets, but it's unclear if property inference attacks that work on other models also affect LLMs, potentially leaking confidential dataset properties.

Method: Created PropInfer benchmark using ChatDoctor dataset with various property types and task configurations. Proposed two attacks: prompt-based generation attack and shadow-model attack using word frequency signals.

Result: Empirical evaluations across multiple pretrained LLMs demonstrated successful property inference, revealing a previously unrecognized vulnerability in LLMs.

Conclusion: LLMs are vulnerable to property inference attacks that can reveal sensitive dataset-level properties after fine-tuning, highlighting a new security concern.

Abstract: Large language models (LLMs) are increasingly fine-tuned on domain-specific datasets to support applications in fields such as healthcare, finance, and law. These fine-tuning datasets often have sensitive and confidential dataset-level properties – such as patient demographics or disease prevalence – that are not intended to be revealed. While prior work has studied property inference attacks on discriminative models (e.g., image classification models) and generative models (e.g., GANs for image data), it remains unclear if such attacks transfer to LLMs. In this work, we introduce PropInfer, a benchmark task for evaluating property inference in LLMs under two fine-tuning paradigms: question-answering and chat-completion. Built on the ChatDoctor dataset, our benchmark includes a range of property types and task configurations. We further propose two tailored attacks: a prompt-based generation attack and a shadow-model attack leveraging word frequency signals. Empirical evaluations across multiple pretrained LLMs show the success of our attacks, revealing a previously unrecognized vulnerability in LLMs.

[960] Tabular Data: Is Deep Learning all you need?

Guri Zabërgja, Arlind Kadra, Christian M. M. Frey, Josif Grabocka

Main category: cs.LG

TL;DR: Deep learning methods now outperform classical ML approaches on tabular data, according to a comprehensive benchmark of 17 state-of-the-art methods across 68 diverse datasets.

DetailsMotivation: Previous literature has shown gradient-boosted decision trees to be scalable and robust for tabular data, but recent deep learning models haven't been comprehensively evaluated under fair comparison conditions with classical approaches.

Method: Benchmarked seventeen state-of-the-art methods including neural networks, classical ML, and AutoML techniques across 68 diverse datasets from an established benchmark.

Result: Empirical results indicate a paradigm shift where deep learning methods now outperform classical approaches on tabular data.

Conclusion: There has been a paradigm shift in tabular data analysis, with deep learning methods demonstrating superior performance over traditional classical ML approaches.

Abstract: Tabular data represent one of the most prevalent data formats in applied machine learning, largely because they accommodate a broad spectrum of real-world problems. Existing literature has studied many of the shortcomings of neural architectures on tabular data and has repeatedly confirmed the scalability and robustness of gradient-boosted decision trees across varied datasets. However, recent deep learning models have not been subjected to a comprehensive evaluation under conditions that allow for a fair comparison with existing classical approaches. This situation motivates an investigation into whether recent deep-learning paradigms outperform classical ML methods on tabular data. Our survey fills this gap by benchmarking seventeen state-of-the-art methods, spanning neural networks, classical ML and AutoML techniques. Our empirical results over 68 diverse datasets from a well-established benchmark indicate a paradigm shift, where Deep Learning methods outperform classical approaches.

[961] PolyNet: Learning Diverse Solution Strategies for Neural Combinatorial Optimization

André Hottung, Mridul Mahajan, Kevin Tierney

Main category: cs.LG

TL;DR: PolyNet improves exploration in combinatorial optimization by learning complementary solution strategies without handcrafted diversity rules, outperforming explicit diversity enforcement approaches.

DetailsMotivation: Current RL methods for combinatorial optimization need better exploration during search. Existing approaches use handcrafted rules to enforce diverse solution generation, but these can impair solution quality and are hard to design for complex problems.

Method: PolyNet uses a single-decoder architecture with a training schema that learns complementary solution strategies without enforcing diversity through handcrafted rules.

Result: PolyNet finds better solutions than approaches that explicitly enforce diverse solution generation across four combinatorial optimization problems.

Conclusion: Implicit diversity mechanisms in PolyNet are more effective than explicit diversity enforcement for improving solution quality in combinatorial optimization.

Abstract: Reinforcement learning-based methods for constructing solutions to combinatorial optimization problems are rapidly approaching the performance of human-designed algorithms. To further narrow the gap, learning-based approaches must efficiently explore the solution space during the search process. Recent approaches artificially increase exploration by enforcing diverse solution generation through handcrafted rules, however, these rules can impair solution quality and are difficult to design for more complex problems. In this paper, we introduce PolyNet, an approach for improving exploration of the solution space by learning complementary solution strategies. In contrast to other works, PolyNet uses only a single-decoder and a training schema that does not enforce diverse solution generation through handcrafted rules. We evaluate PolyNet on four combinatorial optimization problems and observe that the implicit diversity mechanism allows PolyNet to find better solutions than approaches that explicitly enforce diverse solution generation.

[962] CE-GPPO: Coordinating Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning

Zhenpeng Su, Leiyu Pan, Minxuan Lv, Yuntao Li, Wenping Hu, Fuzheng Zhang, Kun Gai, Guorui Zhou

Main category: cs.LG

TL;DR: CE-GPPO is a novel RL algorithm that preserves gradients from clipped tokens in PPO to better manage policy entropy and improve exploration-exploitation balance in LLM training.

DetailsMotivation: Existing RL methods like PPO discard valuable gradient signals from low-probability tokens due to clipping, which negatively impacts entropy regulation and exploration-exploitation balance during LLM training.

Method: CE-GPPO reintroduces gradients from clipped tokens in PPO in a gentle, bounded manner by controlling gradient magnitude from tokens outside the clipping interval, enabling better entropy coordination.

Result: Extensive experiments on mathematical reasoning benchmarks show CE-GPPO consistently outperforms strong baselines across different model scales and effectively mitigates entropy instability.

Conclusion: CE-GPPO provides a theoretically justified and empirically effective approach to managing policy entropy in RL for LLMs, achieving superior performance through better gradient preservation and entropy regulation.

Abstract: Reinforcement learning (RL) has become a powerful paradigm for optimizing large language models (LLMs) to handle complex reasoning tasks. A core challenge in this process lies in managing policy entropy, which reflects the balance between exploration and exploitation during training. Existing methods, such as proximal policy optimization (PPO) and its variants, discard valuable gradient signals from low-probability tokens due to the clipping mechanism. We systematically analyze the entropy dynamics and reveal that these clipped tokens play a critical yet overlooked role in regulating entropy evolution. We propose \textbf{C}oordinating \textbf{E}ntropy via \textbf{G}radient-\textbf{P}reserving \textbf{P}olicy \textbf{O}ptimization (CE-GPPO), a novel algorithm that reintroduces gradients from clipped tokens in native PPO in a gentle and bounded manner. By controlling the magnitude of gradients from tokens outside the clipping interval, CE-GPPO is able to achieve an exploration-exploitation trade-off. We provide theoretical justification and empirical evidence showing that CE-GPPO effectively mitigates entropy instability. Extensive experiments on mathematical reasoning benchmarks show that CE-GPPO consistently outperforms strong baselines across different model scales.

[963] Unified ODE Analysis of Smooth Q-Learning Algorithms

Donghwan Lee

Main category: cs.LG

TL;DR: This paper presents a unified convergence analysis for Q-learning and its smooth variants that improves upon the restrictive switching system approach.

DetailsMotivation: Previous convergence analysis using switching system framework had restrictive conditions like quasi-monotonicity, making it hard to generalize to other reinforcement learning algorithms like smooth Q-learning variants.

Method: The proposed analysis uses a more general ODE model that can cover both asynchronous Q-learning and its smooth versions, building on previous work using p-norm as Lyapunov function but with simpler frameworks.

Result: The analysis provides a unified convergence proof that works for both standard Q-learning and its smooth variants without the restrictive conditions of the switching system approach.

Conclusion: The proposed method offers a more general and unified convergence analysis framework that can handle Q-learning and its smooth variants with simpler mathematical frameworks.

Abstract: Convergence of Q-learning has been the focus of extensive research over the past several decades. Recently, an asymptotic convergence analysis for Q-learning was introduced using a switching system framework. This approach applies the so-called ordinary differential equation (ODE) approach to prove the convergence of the asynchronous Q-learning modeled as a continuous-time switching system, where notions from switching system theory are used to prove its asymptotic stability without using explicit Lyapunov arguments. However, to prove stability, restrictive conditions, such as quasi-monotonicity, must be satisfied for the underlying switching systems, which makes it hard to easily generalize the analysis method to other reinforcement learning algorithms, such as the smooth Q-learning variants. In this paper, we present a more general and unified convergence analysis that improves upon the switching system approach and can analyze Q-learning and its smooth variants. The proposed analysis is motivated by previous work on the convergence of synchronous Q-learning based on $p$-norm serving as a Lyapunov function. However, the proposed analysis addresses more general ODE models that can cover both asynchronous Q-learning and its smooth versions with simpler frameworks.

[964] RL Grokking Recipe: How Does RL Unlock and Transfer New Algorithms in LLMs?

Yiyou Sun, Yuhan Cao, Pohao Huang, Haoyue Bai, Hannaneh Hajishirzi, Nouha Dziri, Dawn Song

Main category: cs.LG

TL;DR: DELTA-Code is a benchmark for evaluating whether LLMs can learn new algorithmic reasoning skills through RL and transfer them to out-of-distribution problems, revealing grokking phase transitions and transfer limitations.

DetailsMotivation: To determine if LLMs can genuinely acquire new reasoning strategies beyond their pre-trained capabilities, addressing the open question about their ability to learn and generalize novel algorithmic skills.

Method: Created DELTA-Code benchmark with synthetic coding problem families using templated generators, employed RL training with techniques like staged warm-up, experience replay, curriculum training, and verification-in-the-loop.

Result: Models show grokking phase transitions - sudden improvement after extended near-zero reward periods. Solid gains within families and recomposed skills, but persistent weaknesses in transformative generalization cases.

Conclusion: DELTA provides a clean testbed for probing RL-driven reasoning limits and understanding how models can acquire new algorithmic skills beyond existing priors, showing both learnability potential and transferability challenges.

Abstract: It remains an open question whether LLMs can acquire or generalize genuinely new reasoning strategies, beyond the sharpened skills encoded in their parameters during pre-training or post-training. To attempt to answer this debate, we introduce DELTA-Code – Distributional Evaluation of Learnability and Transferrability in Algorithmic Coding – a controlled benchmark of synthetic coding problem families designed to probe two fundamental aspects: learnability – can LLMs, through reinforcement learning (RL), solve problem families where pretrained models exhibit failure with large enough attempts (pass@K=0)? – and transferrability – if learnability happens, can such skills transfer systematically to out-of-distribution (OOD) test sets? Unlike prior public coding datasets, DELTA isolates reasoning skills through templated problem generators and introduces fully OOD problem families that demand novel strategies rather than tool invocation or memorized patterns. Our experiments reveal a striking grokking phase transition: after an extended period with near-zero reward, RL-trained models abruptly climb to near-perfect accuracy. To enable learnability on previously unsolvable problem families, we explore key training ingredients such as staged warm-up with dense rewards, experience replay, curriculum training, and verification-in-the-loop. Beyond learnability, we use DELTA to evaluate transferability or generalization along exploratory, compositional, and transformative axes, as well as cross-family transfer. Results show solid gains within families and for recomposed skills, but persistent weaknesses in transformative cases. DELTA thus offers a clean testbed for probing the limits of RL-driven reasoning and for understanding how models can move beyond existing priors to acquire new algorithmic skills.

[965] Characteristic Learning for Provable One Step Generation

Zhao Ding, Chenguang Duan, Yuling Jiao, Ruoxuan Li, Jerry Zhijian Yang, Pingwen Zhang

Main category: cs.LG

TL;DR: The characteristic generator is a novel one-step generative model that combines GAN efficiency with flow model stability, using ODE-based probability transport along characteristics to create a single neural network map from Gaussian to target distribution.

DetailsMotivation: To develop a generative model that combines the fast sampling of GANs with the stable training of flow-based models, while addressing the curse of dimensionality through intrinsic dimension dependence.

Method: Estimate velocity field, solve probability flow ODE using Euler method to generate characteristics, then train a deep neural network to fit these characteristics as a one-step map from Gaussian to target distribution.

Result: Theoretical analysis shows non-asymptotic convergence rate in 2-Wasserstein distance that depends only on intrinsic data dimension, not ambient dimension. Experiments demonstrate high-quality sample generation with single network evaluation efficiency.

Conclusion: The characteristic generator successfully mitigates the curse of dimensionality and provides the first rigorous convergence analysis for flow-based one-step generative models, achieving efficient high-quality generation.

Abstract: We propose the characteristic generator, a novel one-step generative model that combines the efficiency of sampling in Generative Adversarial Networks (GANs) with the stable performance of flow-based models. Our model is driven by characteristics, along which the probability density transport can be described by ordinary differential equations (ODEs). Specifically, we first estimate the underlying velocity field and use the Euler method to solve the probability flow ODE, generating discrete approximations of the characteristics. A deep neural network is then trained to fit these characteristics, creating a one-step map that pushes a simple Gaussian distribution to the target distribution. In the theoretical aspect, we provide a comprehensive analysis of the errors arising from velocity matching, Euler discretization, and characteristic fitting to establish a non-asymptotic convergence rate in the 2-Wasserstein distance under mild data assumptions. Crucially, we demonstrate that under a standard manifold assumption, this convergence rate depends only on the intrinsic dimension of data rather than the much larger ambient dimension, proving our model’s ability to mitigate the curse of dimensionality. To our knowledge, this is the first rigorous convergence analysis for a flow-based one-step generative model. Experiments on both synthetic and real-world datasets demonstrate that the characteristic generator achieves high-quality and high-resolution sample generation with the efficiency of just a single neural network evaluation.

[966] Deep Learning without Weight Symmetry

Li Ji-An, Marcus K. Benna

Main category: cs.LG

TL;DR: The paper introduces Product Feedback Alignment (PFA), a biologically plausible alternative to backpropagation that eliminates explicit weight symmetry while maintaining comparable performance in deep networks.

DetailsMotivation: Backpropagation is biologically implausible due to weight transport and symmetry problems. Existing alternatives like feedback alignment still require bidirectional connections that contradict experimental brain observations.

Method: Proposed Product Feedback Alignment (PFA) algorithm that eliminates explicit weight symmetry entirely while approximating backpropagation’s performance.

Result: PFA achieves comparable performance to backpropagation in deep convolutional networks without requiring weight symmetry, making it more biologically plausible.

Conclusion: PFA offers a novel solution to the longstanding credit assignment problem in the brain, enabling more biologically plausible deep learning compared to previous methods.

Abstract: Backpropagation, a foundational algorithm for training artificial neural networks, predominates in contemporary deep learning. Although highly successful, it is widely considered biologically implausible, because it relies on precise symmetry between feedforward and feedback weights to accurately propagate gradient signals that assign credit. The so-called weight transport problem concerns how biological brains learn to align feedforward and feedback paths while avoiding the non-biological transport of feedforward weights into feedback weights. To address this, several credit assignment algorithms, such as feedback alignment and the Kollen-Pollack rule, have been proposed. While they can achieve the desired weight alignment, these algorithms imply that if a neuron sends a feedforward synapse to another neuron, it should also receive an identical or at least partially correlated feedback synapse from the latter neuron, thereby forming a bidirectional connection. However, this idealized connectivity pattern contradicts experimental observations in the brain, a discrepancy we refer to as the weight symmetry problem. To address this challenge posed by considering biological constraints on connectivity, we introduce the Product Feedback Alignment (PFA) algorithm. We demonstrate that PFA can eliminate explicit weight symmetry entirely while closely approximating backpropagation and achieving comparable performance in deep convolutional networks. Our results offer a novel approach to solve the longstanding problem of credit assignment in the brain, leading to more biologically plausible learning in deep networks compared to previous methods.

[967] Demystifying Higher-Order Graph Neural Networks

Maciej Besta, Florian Scheidl, Lukas Gianinazzi, Grzegorz Kwasniewski, Shachar Klaiman, Jürgen Müller, Torsten Hoefler

Main category: cs.LG

TL;DR: This paper provides a comprehensive taxonomy and blueprint for higher-order graph neural networks (HOGNNs) to address the challenge of analyzing and comparing diverse HOGNN models with different architectures and definitions of “higher-order.”

DetailsMotivation: The proliferation of diverse HOGNN models with different neural architectures and varying interpretations of "higher-order" makes it challenging to analyze, compare, and select appropriate models for specific scenarios.

Method: The authors design an in-depth taxonomy and blueprint for HOGNNs to facilitate model analysis and comparison, enabling the design of models that maximize performance.

Result: The taxonomy is used to analyze and compare available HOGNN models, resulting in insights for selecting the most beneficial GNN model in given scenarios, along with a comprehensive list of challenges and opportunities for future HOGNN research.

Conclusion: The proposed taxonomy and blueprint provide a systematic framework for understanding, comparing, and selecting HOGNN models, while identifying key research directions for developing more powerful higher-order graph neural networks.

Abstract: Higher-order graph neural networks (HOGNNs) and the related architectures from Topological Deep Learning are an important class of GNN models that harness polyadic relations between vertices beyond plain edges. They have been used to eliminate issues such as over-smoothing or over-squashing, to significantly enhance the accuracy of GNN predictions, to improve the expressiveness of GNN architectures, and for numerous other goals. A plethora of HOGNN models have been introduced, and they come with diverse neural architectures, and even with different notions of what the “higher-order” means. This richness makes it very challenging to appropriately analyze and compare HOGNN models, and to decide in what scenario to use specific ones. To alleviate this, we first design an in-depth taxonomy and a blueprint for HOGNNs. This facilitates designing models that maximize performance. Then, we use our taxonomy to analyze and compare the available HOGNN models. The outcomes of our analysis are synthesized in a set of insights that help to select the most beneficial GNN model in a given scenario, and a comprehensive list of challenges and opportunities for further research into more powerful HOGNNs.

[968] Learning to Reason as Action Abstractions with Scalable Mid-Training RL

Shenao Zhang, Donghan Yu, Yihao Feng, Bowen Jin, Zhaoran Wang, John Peebles, Zirui Wang

Main category: cs.LG

TL;DR: Mid-training with action abstractions improves RL performance by identifying compact action sets and enabling fast online RL selection, with RA3 algorithm showing significant gains in code generation tasks.

DetailsMotivation: Large language models benefit from reinforcement learning but require effective mid-training to identify useful action subsets and enable efficient online RL planning.

Method: Proposed Reasoning as Action Abstractions (RA3) - a mid-training algorithm that discovers temporally-consistent latent structures via RL and fine-tunes on bootstrapped data using sequential variational lower bound optimization.

Result: RA3 improves average performance on HumanEval and MBPP by 8 and 4 points over base models, achieves faster convergence and higher asymptotic performance in RLVR on multiple code generation benchmarks.

Conclusion: Mid-training is most effective with compact decision spaces and short horizons, emphasizing the importance of action abstractions over primitive actions for optimal RL performance.

Abstract: Large language models excel with reinforcement learning (RL), but fully unlocking this potential requires a mid-training stage. An effective mid-training phase should identify a compact set of useful actions and enable fast selection among them through online RL. We formalize this intuition by presenting the first theoretical result on how mid-training shapes post-training: it characterizes an action subspace that minimizes both the value approximation error from pruning and the RL error during subsequent planning. Our analysis reveals two key determinants of mid-training effectiveness: pruning efficiency, which shapes the prior of the initial RL policy, and its impact on RL convergence, which governs the extent to which that policy can be improved via online interactions. These results suggest that mid-training is most effective when the decision space is compact and the effective horizon is short, highlighting the importance of operating in the space of action abstractions rather than primitive actions. Building on these insights, we propose Reasoning as Action Abstractions (RA3), a scalable mid-training algorithm. Specifically, we derive a sequential variational lower bound and optimize it by iteratively discovering temporally-consistent latent structures via RL, followed by fine-tuning on the bootstrapped data. Experiments on code generation tasks demonstrate the effectiveness of our approach. Across multiple base models, RA3 improves the average performance on HumanEval and MBPP by 8 and 4 points over the base model and the next-token prediction baseline. Furthermore, RA3 achieves faster convergence and higher asymptotic performance in RLVR on HumanEval+, MBPP+, LiveCodeBench, and Codeforces.

[969] MALT: Improving Reasoning with Multi-Agent LLM Training

Sumeet Ramesh Motwani, Chandler Smith, Rocktim Jyoti Das, Rafael Rafailov, Ivan Laptev, Philip H. S. Torr, Fabio Pizzati, Ronald Clark, Christian Schroeder de Witt

Main category: cs.LG

TL;DR: MALT is a multi-agent LLM training strategy that divides reasoning into generation, verification, and refinement steps using heterogeneous agents in a pipeline, improving performance on complex reasoning tasks without human supervision.

DetailsMotivation: Traditional LLMs use single chain-of-thought reasoning, which limits their ability to explore multiple reasoning paths and self-correct errors in complex tasks.

Method: Uses sequential pipeline of heterogeneous agents for generation, verification, and refinement. Creates multi-agent search tree through repeated sampling, applies value iteration to propagate rewards, and enables off-policy learning from both correct and incorrect trajectories.

Result: Achieved relative improvements of 15.66% on MATH, 7.42% on GSM8K, and 9.40% on CSQA compared to baseline LLM.

Conclusion: MALT represents an important advance in multi-agent cooperative training for LLMs, enabling specialized learning and improved end-to-end reasoning without human supervision.

Abstract: Large Language Models (LLMs) often produce answers with a single chain-of-thought, which restricts their ability to explore reasoning paths or self-correct flawed outputs in complex tasks. In this paper, we introduce MALT (Multi-Agent LLM Training), a novel post-training strategy that divides the reasoning process into generation, verification, and refinement steps using a sequential pipeline of heterogeneous agents. During data generation, each agent is repeatedly sampled to form a multi-agent search tree, where final outputs are graded against ground-truth data. We then apply value iteration to propagate reward signals back to each role-conditioned model, automatically producing multi-agent post-training data without human or teacher-model supervision. Our off-policy approach allows each agent to specialize by learning from correct and incorrect trajectories, ultimately improving the end-to-end reasoning chain. On MATH, GSM8K, and CSQA, MALT surpasses the same baseline LLM with a relative improvement of 15.66%, 7.42%, and 9.40% respectively, making it an important advance towards multi-agent cooperative training.

[970] AtmosSci-Bench: Evaluating the Recent Advance of Large Language Model for Atmospheric Science

Chenyue Li, Wen Deng, Mengqian Lu, Binhang Yuan

Main category: cs.LG

TL;DR: AtmosSci-Bench is a new benchmark for evaluating LLMs in atmospheric science across five core categories using both multiple-choice and open-ended questions.

DetailsMotivation: To provide a robust evaluation framework for assessing LLM capabilities in atmospheric science, enabling better applications in climate services and scientific discovery.

Method: Created a dual-format benchmark with template-based MCQ generation featuring symbolic perturbation, and open-ended questions to test reasoning. Evaluated four categories of LLMs: instruction-tuned, advanced reasoning, math-augmented, and domain-specific climate models.

Result: Comprehensive evaluation revealed interesting insights into LLMs’ reasoning and problem-solving capabilities in atmospheric science.

Conclusion: AtmosSci-Bench serves as a critical step toward advancing LLM applications in climate services by providing a standard and rigorous evaluation framework.

Abstract: The rapid advancements in large language models (LLMs), particularly in their reasoning capabilities, hold transformative potential for addressing complex challenges and boosting scientific discovery in atmospheric science. However, leveraging LLMs effectively in this domain requires a robust and comprehensive evaluation benchmark. Toward this end, we present AtmosSci-Bench, a novel benchmark designed to systematically assess LLM performance across five core categories of atmospheric science problems: hydrology, atmospheric dynamics, atmospheric physics, geophysics, and physical oceanography. AtmosSci-Bench features a dual-format design comprising both multiple-choice questions (MCQs) and open-ended questions (OEQs), enabling scalable automated evaluation alongside deeper analysis of conceptual understanding. We employ a template-based MCQ generation framework to create diverse, graduate-level problems with symbolic perturbation, while OEQs are used to probe open-ended reasoning. We conduct a comprehensive evaluation of representative LLMs, categorized into four groups: instruction-tuned models, advanced reasoning models, math-augmented models, and domain-specific climate models. Our analysis provides some interesting insights into the reasoning and problem-solving capabilities of LLMs in atmospheric science. We believe AtmosSci-Bench can serve as a critical step toward advancing LLM applications in climate services by offering a standard and rigorous evaluation framework. Our source code is available at https://github.com/Relaxed-System-Lab/AtmosSci-Bench.

[971] Taming Latency-Memory Trade-Off in MoE-Based LLM Serving via Fine-Grained Expert Offloading

Hanfei Yu, Xingqi Cui, Hong Zhang, Hao Wang, Hao Wang

Main category: cs.LG

TL;DR: FineMoE is a fine-grained expert offloading system for Mixture-of-Experts (MoE) LLM serving that reduces inference latency by 47% and improves expert hit rate by 39% over state-of-the-art solutions.

DetailsMotivation: MoE-based LLMs suffer from memory inefficiency due to sparsely activated experts. Existing offloading solutions either cause high inference latency or high memory footprints due to coarse-grained designs.

Method: FineMoE extracts fine-grained expert selection patterns from MoE models and semantic hints from input prompts to guide expert prefetching, caching, and offloading decisions. It’s prototyped on HuggingFace Transformers.

Result: Experiments show FineMoE reduces inference latency by 47% and improves expert hit rate by 39% compared to state-of-the-art solutions.

Conclusion: FineMoE effectively addresses the latency-memory trade-off in MoE serving by using fine-grained expert offloading with pattern extraction and semantic guidance.

Abstract: Large Language Models (LLMs) have gained immense success in revolutionizing various applications, including content generation, search and recommendation, and AI-assisted operation. To reduce high training costs, Mixture-of-Experts (MoE) architecture has become a popular backbone for modern LLMs. However, despite the benefits, serving MoE-based LLMs experience severe memory inefficiency due to sparsely activated experts. Recent studies propose to offload inactive experts from GPU memory to CPU memory to improve the serving efficiency of MoE models. However, they either incur high inference latency or high model memory footprints due to coarse-grained designs. To tame the latency-memory trade-off in MoE serving, we present FineMoE, a fine-grained expert offloading system for MoE serving that achieves low inference latency with memory efficiency. We design FineMoE to extract fine-grained expert selection patterns from MoE models and semantic hints from input prompts to efficiently guide expert prefetching, caching, and offloading decisions. FineMoE is prototyped on top of HuggingFace Transformers and deployed on a six-GPU testbed. Experiments with open-source MoE models and real-world workloads show that FineMoE reduces inference latency by 47% and improves expert hit rate by 39% over state-of-the-art solutions.

[972] QuIC: Quantum-Inspired Compound Adapters for Parameter Efficient Fine-Tuning

Snehal Raj, Brian Coyle

Main category: cs.LG

TL;DR: QuIC Adapters is a parameter-efficient fine-tuning method inspired by quantum circuits that uses less than 0.02% memory footprint while preserving pretrained representations through orthogonality constraints.

DetailsMotivation: Address GPU memory and training time constraints in scaling full fine-tuning of large foundation models by developing more efficient PEFT methods.

Method: Quantum-inspired compound adapters that enforce orthogonality in weight parameters, use Hamming-weight preserving circuits, and have native quantum deployment mechanisms. Combines multiple Hamming-weight orders with orthogonality and matrix compounding.

Result: First-order configuration matches existing orthogonal methods’ performance, while higher-order configurations achieve over 40x parameter compression compared to LoRA with modest performance trade-off. Successfully tested on LLaMA and vision transformers across language, math, reasoning and vision benchmarks.

Conclusion: QuIC adapters offer a promising direction for efficient fine-tuning in resource-constrained environments, enabling substantial parameter compression while maintaining performance.

Abstract: Scaling full finetuning of large foundation models strains GPU memory and training time. Parameter Efficient Fine-Tuning (PEFT) methods address this issue via adapter modules which update only a small subset of model parameters. In this work, we introduce Quantum-Inspired Compound Adapters (QuIC Adapters), a PEFT approach inspired from Hamming-weight preserving quantum circuits that can effectively finetune a model using less than 0.02% memory footprint of the base model. QuIC adapters preserve pretrained representations by enforcing orthogonality in weight parameters, and have native deployment mechanisms on quantum computers. We test QuIC adapters by finetuning large language models like LLaMA and vision transformers on language, math, reasoning and vision benchmarks. In its first-order configuration, QuIC recovers the performance of existing orthogonal methods, while higher-order configurations enable substantial parameter compression (over 40x smaller than LoRA) for a modest performance trade-off, unlocking applications in highly resource-constrained environments. Through ablation studies, we determine that combining multiple Hamming-weight orders with orthogonality and matrix compounding are essential for performant finetuning. Our findings suggest that QuIC adapters offers a promising direction for efficient finetuning of foundation models in resource-constrained environments.

[973] Diffusion Approximations for Thompson Sampling in the Small Gap Regime

Lin Fan, Peter W. Glynn

Main category: cs.LG

TL;DR: Analysis of Thompson sampling dynamics in small gap regime, showing convergence to stochastic differential equations and invariance principle for sampling-based bandit algorithms.

DetailsMotivation: To understand the process-level dynamics of Thompson sampling when gaps between arm means are small relative to time horizon, and to develop a general weak convergence theory for sampling-based bandit algorithms.

Method: Developed weak convergence theory using Continuous Mapping Theorem, analyzed Thompson sampling for stationary weakly dependent reward processes, and extended to various sampling-based algorithms including exponential family rewards and bootstrap-based methods.

Result: Process-level dynamics converge weakly to stochastic differential equations; invariance principle shows same weak limits for various sampling-based algorithms as Gaussian Thompson sampling; regret performance is insensitive to model mis-specification in small gap regime.

Conclusion: Thompson sampling and related sampling-based bandit algorithms exhibit universal behavior in small gap regime with convergence to common stochastic processes, and their performance is robust to model mis-specification.

Abstract: We study the process-level dynamics of Thompson sampling in the ``small gap’’ regime. The small gap regime is one in which the gaps between the arm means are of order $\sqrt{\gamma}$ or smaller and the time horizon is of order $1/\gamma$, where $\gamma$ is small. As $\gamma \downarrow 0$, we show that the process-level dynamics of Thompson sampling converge weakly to the solutions to certain stochastic differential equations and stochastic ordinary differential equations. Our weak convergence theory is developed from first principles using the Continuous Mapping Theorem, can handle stationary, weakly dependent reward processes, and can also be adapted to analyze a variety of sampling-based bandit algorithms. Indeed, we show that the process-level dynamics of many sampling-based bandit algorithms – including Thompson sampling designed for any single-parameter exponential family of rewards, as well as non-parametric bandit algorithms based on bootstrap re-sampling – satisfy an invariance principle. Namely, their weak limits coincide with that of Gaussian parametric Thompson sampling with Gaussian priors. Moreover, in the small gap regime, the regret performance of these algorithms is generally insensitive to model mis-specification, changing continuously with increasing degrees of mis-specification.

[974] TANTE: Time-Adaptive Operator Learning via Neural Taylor Expansion

Zhikai Wu, Sifan Wang, Shiyang Zhang, Sizhuang He, Min Zhu, Anran Jiao, Lu Lu, David van Dijk

Main category: cs.LG

TL;DR: TANTE is a time-adaptive operator learning framework for PDEs that uses neural Taylor expansion to predict future states with adaptive step sizes, achieving significant accuracy and efficiency improvements over fixed-step methods.

DetailsMotivation: Most existing operator learning methods for time-dependent PDEs use fixed time step sizes during rollout, which limits adaptability to varying temporal complexity and leads to error accumulation.

Method: TANTE predicts future states by performing Taylor expansion at current state, where neural networks learn higher-order temporal derivatives and local radius of convergence to dynamically adjust rollout step sizes.

Result: TANTE achieves superior accuracy and adaptability across various PDE benchmarks, with 60-80% accuracy gains and 30-40% speed-ups at inference time compared to fixed-step baselines.

Conclusion: The proposed time-adaptive framework with neural Taylor expansion effectively reduces cumulative error and improves computational efficiency for operator learning in time-dependent PDEs.

Abstract: Operator learning for time-dependent partial differential equations (PDEs) has seen rapid progress in recent years, enabling efficient approximation of complex spatiotemporal dynamics. However, most existing methods rely on fixed time step sizes during rollout, which limits their ability to adapt to varying temporal complexity and often leads to error accumulation. Here, we propose the Time-Adaptive Transformer with Neural Taylor Expansion (TANTE), a novel operator-learning framework that produces continuous-time predictions with adaptive step sizes. TANTE predicts future states by performing a Taylor expansion at the current state, where neural networks learn both the higher-order temporal derivatives and the local radius of convergence. This allows the model to dynamically adjust its rollout based on the local behavior of the solution, thereby reducing cumulative error and improving computational efficiency. We demonstrate the effectiveness of TANTE across a wide range of PDE benchmarks, achieving superior accuracy and adaptability compared to fixed-step baselines, delivering accuracy gains of 60-80 % and speed-ups of 30-40 % at inference time. The code is publicly available at https://github.com/zwu88/TANTE for transparency and reproducibility.

[975] Generalization of LiNGAM that allows confounding

Joe Suzuki, Tian-Le Yang

Main category: cs.LG

TL;DR: LiNGAM-MMI enhances LiNGAM by quantifying confounding using KL divergence and finding optimal variable order through shortest path formulation, improving accuracy with efficient computation.

DetailsMotivation: Traditional LiNGAM struggles with confounding effects, requiring high computational resources regardless of confounding presence and not guaranteeing detection of all confounding types.

Method: Introduces LiNGAM-MMI that measures confounding magnitude using KL divergence and arranges variables to minimize confounding impact through shortest path problem formulation for global optimization.

Result: LiNGAM-MMI achieves comparable efficiency to traditional LiNGAM without confounding while effectively handling confounding scenarios, more accurately determining correct variable order in both cases.

Conclusion: LiNGAM-MMI provides an improved approach that maintains efficiency in non-confounding cases while effectively addressing confounding through KL divergence-based optimization.

Abstract: LiNGAM determines the variable order from cause to effect using additive noise models, but it faces challenges with confounding. Previous methods maintained LiNGAM’s fundamental structure while trying to identify and address variables affected by confounding. As a result, these methods required significant computational resources regardless of the presence of confounding, and they did not ensure the detection of all confounding types. In contrast, this paper enhances LiNGAM by introducing LiNGAM-MMI, a method that quantifies the magnitude of confounding using KL divergence and arranges the variables to minimize its impact. This method efficiently achieves a globally optimal variable order through the shortest path problem formulation. LiNGAM-MMI processes data as efficiently as traditional LiNGAM in scenarios without confounding while effectively addressing confounding situations. Our experimental results suggest that LiNGAM-MMI more accurately determines the correct variable order, both in the presence and absence of confounding. The code is in the supplementary file in this link: https://github.com/SkyJoyTianle/ISIT2024.

[976] Position Paper: Assessing Robustness, Privacy, and Fairness in Federated Learning Integrated with Foundation Models

Jiaqi Wang, Xi Li

Main category: cs.LG

TL;DR: Integration of Foundation Models into Federated Learning addresses data scarcity and computational limitations but introduces new challenges in robustness, privacy, and fairness that require systematic evaluation and mitigation strategies.

DetailsMotivation: Federated Learning faces challenges with limited data availability and computational resource variability, while Foundation Models offer potential solutions through pre-training and data augmentation, but their integration creates new issues in robustness, privacy, and fairness that need investigation.

Method: Systematic evaluation of FM-FL integration implications across robustness, privacy, and fairness dimensions, analyzing trade-offs, identifying threats, and proposing criteria and strategies for addressing challenges.

Result: Identified novel issues introduced by FM-FL integration, uncovered specific threats, and developed a framework of criteria and strategies for navigating the challenges in robustness, privacy, and fairness.

Conclusion: The paper establishes foundational research directions for advancing FM-FL integration, providing a basis for developing reliable, secure, and equitable Federated Learning systems through systematic evaluation and mitigation strategies.

Abstract: Federated Learning (FL), while a breakthrough in decentralized machine learning, contends with significant challenges such as limited data availability and the variability of computational resources, which can stifle the performance and scalability of the models. The integration of Foundation Models (FMs) into FL presents a compelling solution to these issues, with the potential to enhance data richness and reduce computational demands through pre-training and data augmentation. However, this incorporation introduces novel issues in terms of robustness, privacy, and fairness, which have not been sufficiently addressed in the existing research. We make a preliminary investigation into this field by systematically evaluating the implications of FM-FL integration across these dimensions. We analyze the trade-offs involved, uncover the threats and issues introduced by this integration, and propose a set of criteria and strategies for navigating these challenges. Furthermore, we identify potential research directions for advancing this field, laying a foundation for future development in creating reliable, secure, and equitable FL systems.

[977] What is Intelligence? A Cycle Closure Perspective

Xin Li

Main category: cs.LG

TL;DR: Intelligence emerges from topological closure law (∂²=0) that creates persistent cycles/memory, enabling prediction through the Structure-before-Specificity principle and Memory-Amortized Inference mechanism.

DetailsMotivation: To provide a unified mathematical foundation for intelligence based on topological principles that explain how memory, prediction, and symbolic reasoning emerge from fundamental closure laws.

Method: Proposes a structural-dynamical framework using topological closure (∂²=0), Structure-before-Specificity principle, Context-Content Uncertainty Principle, and Memory-Amortized Inference with temporal and spatial bootstrapping.

Result: Shows that persistent cycles create memory invariants that enable prediction, explaining why semantics precedes syntax and providing an evolutionary trajectory from primitive memory to symbolic abstraction.

Conclusion: Intelligence arises from the progressive collapse of specificity into structure through closure-induced emergence of invariants, unifying natural intelligence from microbes to humans.

Abstract: What is intelligence? We argue for a structural-dynamical account rooted in a topological closure law: \emph{the boundary of a boundary vanishes} ($\partial^2=0$). This principle forces transient fragments to cancel while closed cycles persist as invariants, yielding the cascade $\partial^2!=!0 \Rightarrow \text{cycles (invariants)} \Rightarrow \text{memory} \Rightarrow \text{prediction (intelligence)}$. Prediction requires invariance: only order-invariant cycles can stabilize the predictive substrate. This motivates the \textbf{Structure-before-Specificity (SbS)} principle, where persistent structures ($\Phi$) must stabilize before contextual specificities ($\Psi$) can be meaningfully interpreted, and is formalized by the \textbf{Context-Content Uncertainty Principle (CCUP)}, which casts cognition as dynamic alignment that minimizes the joint uncertainty $H(\Phi,\Psi)$. We show that \textbf{Memory-Amortized Inference (MAI)} is the computational mechanism that implements SbS,$\rightarrow$,CCUP through dual bootstrapping: \emph{temporal} bootstrapping consolidates episodic specifics into reusable latent trajectories, while \emph{spatial} bootstrapping reuses these invariants across latent manifolds. This framework explains why \emph{semantics precedes syntax}: stable cycles anchor meaning, and symbolic syntax emerges only after semantic invariants are in place. In an evolutionary perspective, the same closure law unifies the trajectory of natural intelligence: from primitive memory traces in microbes, to cyclic sensorimotor patterns in bilaterians, to semantic generalization in mammals, culminating in human symbolic abstraction by natural language. In sum, intelligence arises from the progressive collapse of specificity into structure, grounded in the closure-induced emergence of invariants.

[978] Asynchronous Federated Stochastic Optimization for Heterogeneous Objectives Under Arbitrary Delays

Charikleia Iakovidou, Kibaek Kim

Main category: cs.LG

TL;DR: AREA is an asynchronous federated learning method that uses client-side memory to correct bias from uneven client participation without requiring sampling or prior knowledge of client latencies, achieving optimal convergence rates while being compatible with secure aggregation.

DetailsMotivation: To address the performance issues in federated learning caused by slow clients and the bias introduced by heterogeneous client response times under non-IID data, particularly in client-driven setups where sampling is infeasible.

Method: AREA is a stochastic (sub)gradient method that leverages asynchrony for scalability and uses client-side memory to correct bias from uneven participation. It communicates model residuals rather than gradient estimates and is compatible with secure aggregation.

Result: AREA achieves optimal convergence rates: O(1/K) in strongly convex, smooth regime and O(1/√K) in convex, nonsmooth regime. It accommodates larger step sizes than existing methods and demonstrates increased robustness to outliers by scaling with average client update frequency rather than min/max.

Conclusion: AREA effectively addresses the bias issues in asynchronous federated learning without requiring client sampling or prior knowledge of client latencies, achieving optimal convergence rates while maintaining compatibility with secure aggregation and enabling fast convergence without impacting model generalization.

Abstract: Federated learning (FL) was recently proposed to securely train models with data held over multiple locations (``clients’’) under the coordination of a central server. Prolonged training times caused by slow clients may hinder the performance of FL; while asynchronous communication is a promising solution, highly heterogeneous client response times under non-IID local data may introduce significant bias to the global model, particularly in client-driven setups where sampling is infeasible. To address this issue, we propose \underline{A}synch\underline{R}onous \underline{E}xact \underline{A}veraging (\textsc{AREA}), a stochastic (sub)gradient method that leverages asynchrony for scalability and uses client-side memory to correct the bias induced by uneven participation, without client sampling or prior knowledge of client latencies. \textsc{AREA} communicates model residuals rather than gradient estimates, reducing exposure to gradient inversion, and is compatible with secure aggregation. Under standard assumptions and unbounded, heterogeneous delays with finite mean, AREA achieves optimal convergence rates: $\mathcal{O}(1/K)$ in the strongly convex, smooth regime and $\mathcal{O}(1/\sqrt{K})$ in the convex, nonsmooth regime. For strongly convex, smooth objectives, we demonstrate theoretically and empirically that AREA accommodates larger step sizes than existing methods, enabling fast convergence without adversely impacting model generalization. In the convex, nonsmooth setting, to our knowledge we are the first to obtain rates that scale with the average client update frequency rather than the minimum or maximum, indicating increased robustness to outliers.

[979] Federated Continual Learning Goes Online: Uncertainty-Aware Memory Management for Vision Tasks and Beyond

Giuseppe Serra, Florian Buettner

Main category: cs.LG

TL;DR: Proposes an uncertainty-aware memory-based approach using Bregman Information to reduce catastrophic forgetting in Federated Continual Learning across different modalities in online streaming scenarios.

DetailsMotivation: Current FCL approaches are limited to vision tasks, require offline settings with stored datasets, and use generative methods that need multiple training epochs. Need solutions for online streaming data across different modalities.

Method: Uses Bregman Information-based estimator to compute model variance at sample level, measures predictive uncertainty to retrieve specific samples, and retrains model on selected samples to reduce forgetting.

Result: Demonstrates potential to reduce catastrophic forgetting in realistic settings while maintaining data confidentiality and competitive communication efficiency compared to state-of-the-art approaches.

Conclusion: The proposed uncertainty-aware memory-based approach effectively addresses catastrophic forgetting in online FCL scenarios across different modalities, offering practical advantages over existing methods.

Abstract: Given the ability to model more realistic and dynamic problems, Federated Continual Learning (FCL) has been increasingly investigated recently. A well-known problem encountered in this setting is the so-called catastrophic forgetting, for which the learning model is inclined to focus on more recent tasks while forgetting the previously learned knowledge. The majority of the current approaches in FCL propose generative-based solutions to solve said problem. However, this setting requires multiple training epochs over the data, implying an offline setting where datasets are stored locally and remain unchanged over time. Furthermore, the proposed solutions are tailored for vision tasks solely. To overcome these limitations, we propose a new approach to deal with different modalities in the online scenario where new data arrive in streams of mini-batches that can only be processed once. To solve catastrophic forgetting, we propose an uncertainty-aware memory-based approach. Specifically, we suggest using an estimator based on the Bregman Information (BI) to compute the model’s variance at the sample level. Through measures of predictive uncertainty, we retrieve samples with specific characteristics, and

  • by retraining the model on such samples - we demonstrate the potential of this approach to reduce the forgetting effect in realistic settings while maintaining data confidentiality and competitive communication efficiency compared to state-of-the-art approaches.

[980] Optimal Bound for PCA with Outliers using Higher-Degree Voronoi Diagrams

Sajjad Hashemian, Mohammad Saeed Arvenaghi, Ebrahim Ardeshir-Larijani

Main category: cs.LG

TL;DR: New PCA algorithms using computational geometry techniques (higher-degree Voronoi diagrams) and Grassmannian manifold sampling to handle outliers, achieving optimal solutions with improved time complexity.

DetailsMotivation: To develop robust PCA methods that can effectively handle outliers in datasets, which is a common challenge in real-world data analysis scenarios.

Method: Two approaches: (1) Using higher-degree Voronoi diagrams to navigate to optimal PCA subspace, (2) Randomized algorithm sampling subspaces from Grassmannian manifold with probability proportional to 2^{r(d-r)}.

Result: Achieved optimal PCA solution with time complexity n^{d+O(1)}poly(n,d) and randomized algorithm with complexity 2^{O(r(d-r))} × poly(n,d) with success probability (1-δ)^T.

Conclusion: The proposed methods using computational geometry and Grassmannian sampling provide clearer conceptual pathways and practical advantages for handling large datasets and higher-dimensional settings with outliers.

Abstract: In this paper, we introduce new algorithms for Principal Component Analysis (PCA) with outliers. Utilizing techniques from computational geometry, specifically higher-degree Voronoi diagrams, we navigate to the optimal subspace for PCA even in the presence of outliers. This approach achieves an optimal solution with a time complexity of $n^{d+\mathcal{O}(1)}\text{poly}(n,d)$. Additionally, we present a randomized algorithm with a complexity of $2^{\mathcal{O}(r(d-r))} \times \text{poly}(n, d)$. This algorithm samples subspaces characterized in terms of a Grassmannian manifold. By employing such sampling method, we ensure a high likelihood of capturing the optimal subspace, with the success probability $(1 - \delta)^T$. Where $\delta$ represents the probability that a sampled subspace does not contain the optimal solution, and $T$ is the number of subspaces sampled, proportional to $2^{r(d-r)}$. Our use of higher-degree Voronoi diagrams and Grassmannian based sampling offers a clearer conceptual pathway and practical advantages, particularly in handling large datasets or higher-dimensional settings.

[981] Temporal Source Recovery for Time-Series Source-Free Unsupervised Domain Adaptation

Yucheng Wang, Peiliang Gong, Min Wu, Felix Ott, Xiaoli Li, Lihua Xie, Zhenghua Chen

Main category: cs.LG

TL;DR: TemSR is a Source-Free UDA framework for time-series data that generates source-like domains and recovers temporal dependencies without accessing source data or requiring source-specific pretraining designs.

DetailsMotivation: Time-series data labeling is costly, and privacy concerns prevent source data access in UDA. Existing SFUDA methods struggle with transferring temporal dependencies in TS data, especially without source samples or impractical source-specific pretraining requirements.

Method: TemSR uses masking recovery optimization to generate source-like distributions with restored temporal dependencies, enhanced by local context-aware regularization and anchor-based recovery diversity maximization to preserve local dependencies and promote distributional diversity.

Result: Extensive experiments across multiple TS tasks show TemSR effectively recovers temporal dependencies and facilitates domain transfer, outperforming existing TS-SFUDA methods that require source-specific designs.

Conclusion: TemSR provides an effective and practical solution for TS-SFUDA by leveraging intrinsic TS properties to recover source temporal dependencies without source data access or specific pretraining requirements, enabling standard UDA techniques to work effectively.

Abstract: Time-Series (TS) data has grown in importance with the rise of Internet of Things devices like sensors, but its labeling remains costly and complex. While Unsupervised Domain Adaptation (UDAs) offers an effective solution, growing data privacy concerns have led to the development of Source-Free UDA (SFUDAs), enabling model adaptation to target domains without accessing source data. Despite their potential, applying existing SFUDAs to TS data is challenging due to the difficulty of transferring temporal dependencies, an essential characteristic of TS data, particularly in the absence of source samples. Although prior works attempt to address this by specific source pretraining designs, such requirements are often impractical, as source data owners cannot be expected to adhere to particular pretraining schemes. To address this, we propose Temporal Source Recovery (TemSR), a framework that leverages the intrinsic properties of TS data to generate a source-like domain and recover source temporal dependencies. With this domain, TemSR enables dependency transfer to the target domain without accessing source data or relying on source-specific designs, thereby facilitating effective and practical TS-SFUDA. TemSR features a masking recovery optimization process to generate a source-like distribution with restored temporal dependencies. This distribution is further refined through local context-aware regularization to preserve local dependencies, and anchor-based recovery diversity maximization to promote distributional diversity. Together, these components enable effective temporal dependency recovery and facilitate transfer across domains using standard UDA techniques. Extensive experiments across multiple TS tasks demonstrate the effectiveness of TemSR, which even surpasses existing TS-SFUDA methods that require source-specific designs.

[982] Learning-Augmented Robust Algorithmic Recourse

Kshitij Kayastha, Vasilis Gkatzelis, Shahin Jabbari

Main category: cs.LG

TL;DR: This paper introduces learning-augmented algorithmic recourse to handle model updates by using predictions of future models to reduce recourse costs while maintaining robustness.

DetailsMotivation: Machine learning models frequently update, making algorithmic recourse ineffective over time. Current robust recourse methods are too costly, so there's a need for approaches that balance cost-effectiveness with robustness to model changes.

Method: Proposed a novel learning-augmented algorithmic recourse framework that uses predictions of future model updates to optimize recourse costs while ensuring robustness when predictions are inaccurate.

Result: The study examines the trade-off between robustness and consistency, showing how prediction accuracy impacts performance in maintaining effective recourse across model updates.

Conclusion: Learning-augmented recourse provides a promising approach to handle model updates by leveraging predictions to reduce costs while maintaining robustness, with performance dependent on prediction accuracy.

Abstract: Algorithmic recourse provides individuals who receive undesirable outcomes from machine learning systems with minimum-cost improvements to achieve a desirable outcome. However, machine learning models often get updated, so the recourse may not lead to the desired outcome. The robust recourse framework chooses recourses that are less sensitive to adversarial model changes, but this comes at a higher cost. To address this, we initiate the study of learning-augmented algorithmic recourse and evaluate the extent to which a designer equipped with a prediction of the future model can reduce the cost of recourse when the prediction is accurate (consistency) while also limiting the cost even when the prediction is inaccurate (robustness). We propose a novel algorithm, study the robustness-consistency trade-off, and analyze how prediction accuracy affects performance.

[983] Low-Dimension-to-High-Dimension Generalization And Its Implications for Length Generalization

Yang Chen, Long Yang, Yitao Liang, Zhouchen Lin

Main category: cs.LG

TL;DR: The paper analyzes Low-Dimension-to-High-Dimension (LDHD) generalization as a special case of OOD generalization, showing it’s generally unattainable without proper inductive bias. The authors demonstrate this in Boolean functions and apply insights to explain CoT’s effectiveness and propose RPE-Square position embedding.

DetailsMotivation: To understand the scaling challenge in length generalization through LDHD generalization framework, where training data are restricted to low-dimensional subspaces while testing occurs in high-dimensional spaces.

Method: Theoretical analysis of LDHD generalization in Boolean functions, showing different architectures converge to min-degree interpolators. Applied insights to explain Chain-of-Thought (CoT) effectiveness and proposed RPE-Square position embedding design principle.

Result: LDHD generalization is achievable only when target function aligns with the inductive bias of min-degree interpolators. CoT works by changing latent space structure to enable better LDHD generalization. RPE-Square position embedding effectively handles both inherent LDHD generalization and data format nuisances.

Conclusion: LDHD generalization requires exploiting prior knowledge for appropriate inductive bias. The framework explains length generalization challenges and provides principled approaches for position embedding design to address both inherent scaling issues and data format problems.

Abstract: Low-Dimension-to-High-Dimension (LDHD) generalization is a special case of Out-of-Distribution (OOD) generalization, where the training data are restricted to a low-dimensional subspace of the high-dimensional testing space. Assuming that each instance is generated from a latent variable and the dimension of the latent variable reflects the problem scale, the inherent scaling challenge in length generalization can be captured by the LDHD generalization in the latent space. We theoretically demonstrate that LDHD generalization is generally unattainable without exploiting prior knowledge to provide appropriate inductive bias. Specifically, we explore LDHD generalization in Boolean functions. We verify that different architectures trained with (S)GD converge to \emph{min-degree interpolators w.r.t. different independent sets}. LDHD generalization is achievable if and only if the target function coincides with this inductive bias. Applying the insights from LDHD generalization to length generalization, we explain the effectiveness of CoT as changing the structure latent space to enable better LDHD generalization. We also propose a principle for position embedding design to handle both the inherent LDHD generalization and the nuisances such as the data format. Following the principle, we propose a novel position embedding called RPE-Square that remedies the RPE for dealing with the data format nuisance.

[984] Large EEG-U-Transformer for Time-Step Level Detection Without Pre-Training

Kerui Wu, Ziyue Zhao, Bülent Yener

Main category: cs.LG

TL;DR: A U-shaped model combining convolution and self-attention for EEG analysis that directly outputs time-step predictions, eliminating redundant processing and achieving state-of-the-art performance in seizure detection and other neurological tasks.

DetailsMotivation: Existing EEG detection methods face limitations: traditional models have limited parameters and modest performance, while foundation models require extensive pre-training. Both require complex post-processing to convert discrete labels to continuous annotations.

Method: Proposed a U-shaped model that captures both local and global EEG features using convolution and self-attentive modules for sequence-to-sequence modeling. The model directly outputs time-step level predictions and can extend to window-level classification with attention-pooling.

Result: Achieved promising efficiency improvement, cross-subject generalization, and state-of-the-art performance in various time-step and window-level classification tasks. Outperformed existing large foundation models using only downstream fine-tuning data. Won 1st place in the 2025 seizure detection challenge.

Conclusion: The proposed paradigm shift and model design demonstrate superior efficiency and performance in EEG analysis, eliminating redundant processing while maintaining scalability comparable to large foundation models without extensive pre-training requirements.

Abstract: Electroencephalography (EEG) reflects the brain’s functional state, making it a crucial tool for diverse detection applications like seizure detection and sleep stage classification. While deep learning-based approaches have recently shown promise for automated detection, traditional models are often constrained by limited learnable parameters and only achieve modest performance. In contrast, large foundation models showed improved capabilities by scaling up the model size, but required extensive time-consuming pre-training. Moreover, both types of existing methods require complex and redundant post-processing pipelines to convert discrete labels to continuous annotations. In this work, based on the multi-scale nature of EEG events, we propose a simple U-shaped model to efficiently learn representations by capturing both local and global features using convolution and self-attentive modules for sequence-to-sequence modeling. Compared to other window-level classification models, our method directly outputs predictions at the time-step level, eliminating redundant overlapping inferences. Beyond sequence-to-sequence modeling, the architecture naturally extends to window-level classification by incorporating an attention-pooling layer. Such a paradigm shift and model design demonstrated promising efficiency improvement, cross-subject generalization, and state-of-the-art performance in various time-step and window-level classification tasks in the experiment. More impressively, our model showed the capability to be scaled up to the same level as existing large foundation models that have been extensively pre-trained over diverse datasets and outperforms them by solely using the downstream fine-tuning dataset. Our model won 1st place in the 2025 “seizure detection challenge” organized in the International Conference on Artificial Intelligence in Epilepsy and Other Neurological Disorders.

[985] Learning Orthogonal Multi-Index Models: A Fine-Grained Information Exponent Analysis

Yunwei Ren, Jason D. Lee

Main category: cs.LG

TL;DR: The paper shows that considering only the lowest degree in multi-index models leads to suboptimal rates, and proposes a method that uses both second- and higher-order terms to achieve better sample complexity.

DetailsMotivation: Traditional approaches using only the information exponent (lowest degree in Hermite expansion) miss key structural details in multi-index models, resulting in suboptimal sample complexity for online SGD.

Method: Proposes a two-phase approach: first learn the relevant subspace using second-order terms, then learn exact directions using higher-order terms in multi-index models.

Result: Achieves improved sample complexity of Õ(dP^{L-1}) for online SGD in multi-index models, compared to the traditional Õ(Pd^{L-1}) when considering only the lowest degree.

Conclusion: Considering both second- and higher-order terms in multi-index models enables more efficient learning with better sample complexity than approaches focusing solely on the information exponent.

Abstract: The information exponent ([BAGJ21]) and its extensions – which are equivalent to the lowest degree in the Hermite expansion of the link function (after a potential label transform) for Gaussian single-index models – have played an important role in predicting the sample complexity of online stochastic gradient descent (SGD) in various learning tasks. In this work, we demonstrate that, for multi-index models, focusing solely on the lowest degree can miss key structural details of the model and result in suboptimal rates. Specifically, we consider the task of learning target functions of form $f_(\mathbf{x}) = \sum_{k=1}^{P} \phi(\mathbf{v}_k^ \cdot \mathbf{x})$, where $P \ll d$, the ground-truth directions ${ \mathbf{v}k^* }{k=1}^P$ are orthonormal, and the information exponent of $\phi$ is $L$. Based on the theory of information exponent, when $L = 2$, only the relevant subspace (not the exact directions) can be recovered due to the rotational invariance of the second-order terms, and when $L > 2$, recovering the directions using online SGD require $\tilde{O}(P d^{L-1})$ samples. In this work, we show that by considering both second- and higher-order terms, we can first learn the relevant space using the second-order terms, and then the exact directions using the higher-order terms, and the overall sample and complexity of online SGD is $\tilde{O}( d P^{L-1} )$.

[986] Comparative Performance of Collaborative Bandit Algorithms: Effect of Sparsity and Exploration Intensity

Eren Ozbay, Ashkan Golgoon

Main category: cs.LG

TL;DR: This paper analyzes collaborative bandit algorithms that improve contextual bandits by sharing information between related arms/items, addressing cold start problems through hard and soft clustering approaches.

DetailsMotivation: To improve contextual bandit performance by enabling collaboration between arms/items, which allows feedback sharing across related entities and alleviates cold user/item problems where new entities lack historical interaction data.

Method: Focuses on soft clustering approaches that allow fuzzy assignment of relationships between arms, and conducts extensive experiments on state-of-the-art collaborative contextual bandit algorithms, examining the effects of sparsity and exploration intensity.

Result: Numerical experiments show that controlling sparsity in collaboration improves data efficiency and performance by better informing learning. Increased exploration intensity acts as a correction mechanism that reduces variance from potentially misspecified relationships among users.

Conclusion: Collaborative bandits with controlled sparsity and appropriate exploration intensity significantly improve performance, and misspecification issues can be further remedied by introducing latent factors to increase the dimensionality of bandit parameters.

Abstract: This paper offers a comprehensive analysis of collaborative bandit algorithms and provides a thorough comparison of their performance. Collaborative bandits aim to improve the performance of contextual bandits by introducing relationships between arms (or items), allowing effective propagation of information. Collaboration among arms allows the feedback obtained through a single user (item) to be shared across related users (items). Introducing collaboration also alleviates the cold user (item) problem, i.e., lack of historical information when a new user (item) arriving to the platform with no prior record of interactions. In the context of modeling the relationships between arms (items), there are two main approaches: Hard and soft clustering. We call approaches that model the relationship between arms in an \textit{absolute} manner as hard clustering, i.e., the relationship is binary. Soft clustering relaxes membership constraints, allowing \textit{fuzzy} assignment. Focusing on the latter, we provide extensive experiments on the state-of-the-art collaborative contextual bandit algorithms and investigate the effect of sparsity and how the exploration intensity acts as a correction mechanism. Our numerical experiments demonstrate that controlling for sparsity in collaboration improves data efficiency and performance as it better informs learning. Meanwhile, increasing the exploration intensity acts as a correction because it effectively reduces variance due to potentially misspecified relationships among users. We observe that this misspecification is further remedied by introducing latent factors, and thus, increasing the dimensionality of the bandit parameters.

[987] What Lurks Within? Concept Auditing for Shared Diffusion Models at Scale

Xiaoyong Yuan, Xiaolong Ma, Linke Guo, Lan Zhang

Main category: cs.LG

TL;DR: PAIA is a model-centric framework for auditing fine-tuned diffusion models to detect specific target concepts without needing optimized prompts or generated images, achieving over 90% accuracy and 18-40X speedup.

DetailsMotivation: Address growing ethical and legal concerns about fine-tuned diffusion models generating sensitive or unauthorized content, as current methods lack practical pre-deployment auditing tools.

Method: Prompt-Agnostic Image-Free Auditing (PAIA) analyzes internal model behavior directly, bypassing prompt optimization and image generation steps used in traditional approaches.

Result: Achieved over 90% detection accuracy on 320 controlled models and 771 real-world community models, reducing auditing time by 18-40X compared to baselines.

Conclusion: PAIA provides the first scalable and practical solution for pre-deployment concept auditing of diffusion models, enabling safer and more transparent model sharing.

Abstract: Diffusion models (DMs) have revolutionized text-to-image generation, enabling the creation of highly realistic and customized images from text prompts. With the rise of parameter-efficient fine-tuning (PEFT) techniques, users can now customize powerful pre-trained models using minimal computational resources. However, the widespread sharing of fine-tuned DMs on open platforms raises growing ethical and legal concerns, as these models may inadvertently or deliberately generate sensitive or unauthorized content. Despite increasing regulatory attention on generative AI, there are currently no practical tools for systematically auditing these models before deployment. In this paper, we address the problem of concept auditing: determining whether a fine-tuned DM has learned to generate a specific target concept. Existing approaches typically rely on prompt-based input crafting and output-based image classification but they suffer from critical limitations, including prompt uncertainty, concept drift, and poor scalability. To overcome these challenges, we introduce Prompt-Agnostic Image-Free Auditing (PAIA), a novel, model-centric concept auditing framework. By treating the DM as the object of inspection, PAIA enables direct analysis of internal model behavior, bypassing the need for optimized prompts or generated images. We evaluate PAIA on 320 controlled models trained with curated concept datasets and 771 real-world community models sourced from a public DM sharing platform. Evaluation results show that PAIA achieves over 90% detection accuracy while reducing auditing time by 18 - 40X compared to existing baselines. To our knowledge, PAIA is the first scalable and practical solution for pre-deployment concept auditing of diffusion models, providing a practical foundation for safer and more transparent diffusion model sharing.

[988] The Persistence of Neural Collapse Despite Low-Rank Bias

Connall Garrod, Jonathan P. Keating

Main category: cs.LG

TL;DR: Deep neural collapse (DNC) is suboptimal under cross-entropy loss due to low-rank bias from L2 regularization, with fixed bound on non-negligible singular values at global minima as depth increases.

DetailsMotivation: To extend previous theoretical work showing DNC suboptimality under MSE loss to cross-entropy loss, and understand why DNC appears frequently despite being suboptimal.

Method: Analyzed deep unconstrained feature models (UFMs) with cross-entropy loss, characterized low-rank bias, studied loss surface geometry, and validated with experiments in deep UFMs and neural networks.

Result: High-rank structures like DNC are not generally optimal; proved fixed bound on non-negligible singular values at global minima; DNC is more prevalent in loss landscape than other critical configurations.

Conclusion: DNC’s frequent empirical appearance is explained by its prevalence in the loss landscape, not by optimality, as it’s suboptimal due to low-rank bias from regularization.

Abstract: Neural collapse (NC) and its multi-layer variant, deep neural collapse (DNC), describe a structured geometry that occurs in the features and weights of trained deep networks. Recent theoretical work by Sukenik et al. using a deep unconstrained feature model (UFM) suggests that DNC is suboptimal under mean squared error (MSE) loss. They heuristically argue that this is due to low-rank bias induced by L2 regularization. In this work, we extend this result to deep UFMs trained with cross-entropy loss, showing that high-rank structures, including DNC, are not generally optimal. We characterize the associated low-rank bias, proving a fixed bound on the number of non-negligible singular values at global minima as network depth increases. We further analyze the loss surface, demonstrating that DNC is more prevalent in the landscape than other critical configurations, which we argue explains its frequent empirical appearance. Our results are validated through experiments in deep UFMs and deep neural networks.

[989] VIPO: Value Function Inconsistency Penalized Offline Reinforcement Learning

Xuyang Chen, Guojian Wang, Keyu Yan, Lin Zhao

Main category: cs.LG

TL;DR: VIPO is a model-based offline RL algorithm that uses self-supervised feedback from value estimation to improve model training by minimizing inconsistency between data-based and model-based value estimates.

DetailsMotivation: Model-based offline RL methods often introduce unreliable conservatism due to heuristic uncertainty estimation, which can limit performance. There's a need for more systematic approaches to enhance model accuracy.

Method: VIPO incorporates self-supervised feedback by learning the model while minimizing inconsistency between value estimates from offline data and those estimated from the model.

Result: VIPO learns highly accurate models efficiently and achieves state-of-the-art performance on almost all tasks in both D4RL and NeoRL benchmarks, consistently outperforming existing methods.

Conclusion: VIPO provides a general framework that can be integrated into existing model-based offline RL algorithms to systematically enhance model accuracy through value-based feedback.

Abstract: Offline reinforcement learning (RL) learns effective policies from pre-collected datasets, offering a practical solution for applications where online interactions are risky or costly. Model-based approaches are particularly advantageous for offline RL, owing to their data efficiency and generalizability. However, due to inherent model errors, model-based methods often artificially introduce conservatism guided by heuristic uncertainty estimation, which can be unreliable. In this paper, we introduce VIPO, a novel model-based offline RL algorithm that incorporates self-supervised feedback from value estimation to enhance model training. Specifically, the model is learned by additionally minimizing the inconsistency between the value learned directly from the offline data and the one estimated from the model. We perform comprehensive evaluations from multiple perspectives to show that VIPO can learn a highly accurate model efficiently and consistently outperform existing methods. In particular, it achieves state-of-the-art performance on almost all tasks in both D4RL and NeoRL benchmarks. Overall, VIPO offers a general framework that can be readily integrated into existing model-based offline RL algorithms to systematically enhance model accuracy.

[990] Joint Diffusion models in Continual Learning

Paweł Skierś, Kamil Deja

Main category: cs.LG

TL;DR: JDCL is a new continual learning method using joint diffusion models with generative rehearsal that prevents catastrophic forgetting by combining classifier and generative model in one network.

DetailsMotivation: Neural networks suffer from catastrophic forgetting when trained on new data distributions, and existing generative-replay methods need improvement.

Method: Jointly optimize a classifier and diffusion-based generative model in a single network with knowledge distillation for stable adaptation to new tasks.

Result: Outperforms state-of-the-art generative replay techniques on benchmarks and excels in semi-supervised continual learning compared to buffer-based methods.

Conclusion: Joint diffusion models with shared parametrization and knowledge distillation effectively prevent catastrophic forgetting in continual learning scenarios.

Abstract: In this work, we introduce JDCL - a new method for continual learning with generative rehearsal based on joint diffusion models. Neural networks suffer from catastrophic forgetting defined as abrupt loss in the model’s performance when retrained with additional data coming from a different distribution. Generative-replay-based continual learning methods try to mitigate this issue by retraining a model with a combination of new and rehearsal data sampled from a generative model. In this work, we propose to extend this idea by combining a continually trained classifier with a diffusion-based generative model into a single - jointly optimized neural network. We show that such shared parametrization, combined with the knowledge distillation technique allows for stable adaptation to new tasks without catastrophic forgetting. We evaluate our approach on several benchmarks, where it outperforms recent state-of-the-art generative replay techniques. Additionally, we extend our method to the semi-supervised continual learning setup, where it outperforms competing buffer-based replay techniques, and evaluate, in a self-supervised manner, the quality of trained representations.

[991] Recursive Deep Inverse Reinforcement Learning

Paul Ghanem, Owen Howell, Michael Potter, Pau Closas, Alireza Ramezani, Deniz Erdogmus, Tales Imbiriba

Main category: cs.LG

TL;DR: Proposes RDIRL, an online recursive deep inverse reinforcement learning method that uses second-order Newton updates for fast convergence in recovering adversary cost functions, outperforming existing IRL algorithms.

DetailsMotivation: Existing deep IRL methods are offline, require large batch sizes, and use first-order updates, limiting real-time applicability in domains like cybersecurity and strategy games where inferring adversary goals is crucial.

Method: Minimizes an upper bound on the Guided Cost Learning objective using sequential second-order Newton updates similar to Extended Kalman Filter, enabling online recursive learning.

Result: RDIRL successfully recovers cost and reward functions of expert agents in standard and adversarial benchmark tasks, outperforming several leading IRL algorithms.

Conclusion: The proposed online recursive approach with second-order updates provides a fast and effective method for real-time adversary goal inference in non-cooperative multi-agent systems.

Abstract: Inferring an adversary’s goals from exhibited behavior is crucial for counterplanning and non-cooperative multi-agent systems in domains like cybersecurity, military, and strategy games. Deep Inverse Reinforcement Learning (IRL) methods based on maximum entropy principles show promise in recovering adversaries’ goals but are typically offline, require large batch sizes with gradient descent, and rely on first-order updates, limiting their applicability in real-time scenarios. We propose an online Recursive Deep Inverse Reinforcement Learning (RDIRL) approach to recover the cost function governing the adversary actions and goals. Specifically, we minimize an upper bound on the standard Guided Cost Learning (GCL) objective using sequential second-order Newton updates, akin to the Extended Kalman Filter (EKF), leading to a fast (in terms of convergence) learning algorithm. We demonstrate that RDIRL is able to recover cost and reward functions of expert agents in standard and adversarial benchmark tasks. Experiments on benchmark tasks show that our proposed approach outperforms several leading IRL algorithms.

[992] Detecting LLM Hallucination Through Layer-wise Information Deficiency: Analysis of Ambiguous Prompts and Unanswerable Questions

Hazel Kim, Tom A. Lamb, Adel Bibi, Philip Torr, Yarin Gal

Main category: cs.LG

TL;DR: A novel test-time approach detects LLM hallucinations by analyzing information flow across model layers, revealing that hallucinations manifest as information deficiencies in inter-layer transmissions.

DetailsMotivation: LLMs frequently generate confident but inaccurate responses, posing risks in safety-critical domains, especially when processing inputs with ambiguous or insufficient context.

Method: Systematic analysis of information flow across model layers, tracking cross-layer information dynamics (LI) that accounts for both information gain and loss during computation.

Result: Hallucination manifests as usable information deficiencies in inter-layer transmissions, and LI provides robust indicators of model reliability without requiring additional training.

Conclusion: LI integrates easily with pretrained LLMs without architectural modifications, offering a practical approach to detect hallucinations by monitoring cross-layer information dynamics.

Abstract: Large language models (LLMs) frequently generate confident yet inaccurate responses, introducing significant risks for deployment in safety-critical domains. We present a novel, test-time approach to detecting model hallucination through systematic analysis of information flow across model layers. We target cases when LLMs process inputs with ambiguous or insufficient context. Our investigation reveals that hallucination manifests as usable information deficiencies in inter-layer transmissions. While existing approaches primarily focus on final-layer output analysis, we demonstrate that tracking cross-layer information dynamics ($\mathcal{L}$I) provides robust indicators of model reliability, accounting for both information gain and loss during computation. $\mathcal{L}$I integrates easily with pretrained LLMs without requiring additional training or architectural modifications.

[993] How to build a consistency model: Learning flow maps via self-distillation

Nicholas M. Boffi, Michael S. Albergo, Eric Vanden-Eijnden

Main category: cs.LG

TL;DR: The paper presents a unified framework for learning flow maps (consistency models) that eliminates the need for pre-trained teachers by converting distillation schemes into direct training algorithms via self-distillation.

DetailsMotivation: Flow-based generative models have excellent sample quality but require expensive differential equation solving at inference time. Flow map models promise better efficiency but lack unified training methods.

Method: Three algorithmic families based on mathematical characterizations: Eulerian, Lagrangian, and Progressive methods. Lagrangian methods avoid spatial derivatives and small-step bootstrapping, enabling more stable training.

Result: Lagrangian methods achieve significantly more stable training and higher performance than standard Eulerian and Progressive schemes. The framework unifies existing training schemes and reveals new design principles.

Conclusion: The methodology provides a systematic framework for directly learning flow maps, with Lagrangian methods emerging as superior for stable training and performance in accelerated generative modeling.

Abstract: Flow-based generative models achieve state-of-the-art sample quality, but require the expensive solution of a differential equation at inference time. Flow map models, commonly known as consistency models, encompass many recent efforts to improve inference-time efficiency by learning the solution operator of this differential equation. Yet despite their promise, these models lack a unified description that clearly explains how to learn them efficiently in practice. Here, building on the methodology proposed in Boffi et. al. (2024), we present a systematic algorithmic framework for directly learning the flow map associated with a flow or diffusion model. By exploiting a relationship between the velocity field underlying a continuous-time flow and the instantaneous rate of change of the flow map, we show how to convert any distillation scheme into a direct training algorithm via self-distillation, eliminating the need for pre-trained teachers. We introduce three algorithmic families based on different mathematical characterizations of the flow map: Eulerian, Lagrangian, and Progressive methods, which we show encompass and extend all known distillation and direct training schemes for consistency models. We find that the novel class of Lagrangian methods, which avoid both spatial derivatives and bootstrapping from small steps by design, achieve significantly more stable training and higher performance than more standard Eulerian and Progressive schemes. Our methodology unifies existing training schemes under a single common framework and reveals new design principles for accelerated generative modeling. Associated code is available at https://github.com/nmboffi/flow-maps.

[994] SyMerge: From Non-Interference to Synergistic Merging via Single-Layer Adaptation

Aecheon Jung, Seunghwan Lee, Dongyoon Han, Sungeun Hong

Main category: cs.LG

TL;DR: SyMerge is a lightweight model merging framework that optimizes task-specific layers and merging coefficients to achieve task synergy, outperforming previous methods across vision, dense prediction, and NLP benchmarks.

DetailsMotivation: Most prior model merging approaches focus on avoiding task interference, but the real potential lies in achieving synergy where tasks enhance one another. A pilot study showed that cross-task performance strongly predicts merge quality and adapting single task-specific layers can substantially improve compatibility.

Method: SyMerge jointly optimizes one task-specific layer and merging coefficients using a robust self-labeling strategy guided by expert model predictions, avoiding entropy-based adaptation pitfalls. This minimalist design ensures stability without requiring labels.

Result: SyMerge achieves state-of-the-art results across vision, dense prediction, and NLP benchmarks. The adapted layers also transfer effectively to other merging methods.

Conclusion: The framework demonstrates that principled, lightweight optimization of task-specific layers enables effective model merging with task synergy, outperforming interference-focused approaches while maintaining transferability to other methods.

Abstract: Model merging offers an efficient alternative to multi-task learning by combining independently fine-tuned models, but most prior approaches focus mainly on avoiding task interference. We argue instead that the real potential of merging lies in achieving synergy, where tasks enhance one another. Our intuition comes from a pilot study showing that when a classifier trained on one task is paired with the encoder of another, the resulting cross-task performance strongly predicts merge quality. Moreover, adapting even a single task-specific layer can substantially improve this compatibility, suggesting a simple yet powerful lever for synergy. Building on this insight, we introduce SyMerge, a lightweight framework that jointly optimizes one task-specific layer and merging coefficients. To ensure stability without labels, SyMerge employs a robust self-labeling strategy guided by expert model predictions, avoiding the pitfalls of entropy-based adaptation. This minimalist yet principled design achieves state-of-the-art results across vision, dense prediction, and NLP benchmarks, while also producing adapted layers that transfer effectively to other merging methods. Our code is available at https://aim-skku.github.io/SyMerge/

[995] Chameleon2++: An Efficient and Scalable Variant Of Chameleon Clustering

Priyanshu Singh, Kapil Ahuja

Main category: cs.LG

TL;DR: Chameleon2++ improves hierarchical clustering scalability by reducing time complexity from O(n²) to O(n log n) through approximate k-NN search, multi-level graph partitioning, and efficient merging, achieving 4% better clustering quality.

DetailsMotivation: Traditional hierarchical clustering algorithms fail to scale effectively for large datasets due to O(n²) complexity, and recent Chameleon variants still suffer from this limitation.

Method: Three key improvements: 1) Approximate k-NN search using Annoy algorithm for faster graph generation, 2) Multi-level partitioning with hMETIS instead of recursive bisection, 3) Retaining flood fill heuristic for balanced component merging.

Result: Reduces overall time complexity to O(n log n) and achieves 4% average improvement in clustering quality on real-world benchmark datasets compared to prior Chameleon works.

Conclusion: Algorithmic efficiency and clustering quality can co-exist in large-scale hierarchical clustering through the proposed enhancements.

Abstract: Hierarchical clustering remains a fundamental challenge in data mining, particularly when dealing with large-scale datasets where traditional approaches fail to scale effectively. Recent Chameleon-based algorithms - Chameleon2, M-Chameleon, and INNGS-Chameleon have proposed advanced strategies but they still suffer from $O(n^2)$ computational complexity, especially for large datasets. With Chameleon2 as the base algorithm, we introduce Chameleon2++ that addresses this challenge. Our algorithm has three parts. First, Graph Generation - we propose an approximate $k$-NN search instead of an exact one, specifically we integrate with the Annoy algorithm. This results in fast approximate nearest neighbor computation, significantly reducing the graph generation time. Second, Graph Partitioning - we propose use of a multi-level partitioning algorithm instead of a recursive bisection one. Specifically we adapt the hMETIS algorithm instead of the FM. This is because multi-level algorithms are robust to approximation introduced in the graph generation phase yielding higher-quality partitions, and that too with minimum configuration requirements. Third, Merging - we retain the flood fill heuristic that ensures connected balanced components in the partitions as well as efficient partition merging criteria leading to the final clusters. These enhancements reduce the overall time complexity to $O(n\log n)$, achieving scalability. On real-world benchmark datasets used in prior Chameleon works, Chameleon2++ delivers an average of 4% improvement in clustering quality. This demonstrates that algorithmic efficiency and clustering quality can co-exist in large-scale hierarchical clustering.

[996] Cooperative Decentralized Backdoor Attacks on Vertical Federated Learning

Seohyun Lee, Wenzhi Fang, Anindya Bijoy Das, Seyyedali Hosseinalipour, David J. Love, Christopher G. Brinton

Main category: cs.LG

TL;DR: A novel backdoor attack on vertical federated learning that doesn’t require server gradient information, uses label inference with variational autoencoders and metric learning, and enables collusion among multiple adversaries for coordinated poisoning.

DetailsMotivation: Backdoor attacks are well-studied in horizontal FL but less understood in vertical FL (VFL) where devices hold different features and only the server has labels. Existing approaches often rely on server gradient information.

Method: Adversaries train label inference models locally using variational autoencoders with metric learning. They coordinate through consensus process on graph topology to select datapoints for poisoning, with trigger splitting across adversaries and intensity-based implantation.

Result: The attack achieves significantly higher success rates than recent VFL backdoor approaches while maintaining main task performance, despite not using server information. Collusion among adversaries improves attack performance.

Conclusion: The proposed attack demonstrates vulnerabilities in VFL systems even without server gradient access, highlighting the importance of considering collusion and coordinated poisoning in VFL security.

Abstract: Federated learning (FL) is vulnerable to backdoor attacks, where adversaries alter model behavior on target classification labels by embedding triggers into data samples. While these attacks have received considerable attention in horizontal FL, they are less understood for vertical FL (VFL), where devices hold different features of the samples, and only the server holds the labels. In this work, we propose a novel backdoor attack on VFL which (i) does not rely on gradient information from the server and (ii) considers potential collusion among multiple adversaries for sample selection and trigger embedding. Our label inference model augments variational autoencoders with metric learning, which adversaries can train locally. A consensus process over the adversary graph topology determines which datapoints to poison. We further propose methods for trigger splitting across the adversaries, with an intensity-based implantation scheme skewing the server towards the trigger. Our convergence analysis reveals the impact of backdoor perturbations on VFL indicated by a stationarity gap for the trained model, which we verify empirically as well. We conduct experiments comparing our attack with recent backdoor VFL approaches, finding that ours obtains significantly higher success rates for the same main task performance despite not using server information. Additionally, our results verify the impact of collusion on attack performance.

[997] Learning to Bid in Non-Stationary Repeated First-Price Auctions

Zihao Hu, Xiaoyu Fan, Yuan Yao, Jiheng Zhang, Zhengyuan Zhou

Main category: cs.LG

TL;DR: This paper addresses optimal bidding strategies in first-price auctions, focusing on dynamic regret minimization against non-stationary opponents using novel regularity metrics and Optimistic Mirror Descent.

DetailsMotivation: First-price auctions are increasingly used in digital advertising, but optimal bidding is complex. Existing approaches use static benchmarks that perform poorly in non-stationary environments, creating a need for dynamic benchmarks.

Method: The authors introduce two metrics to quantify opponent bid regularity, use Optimistic Mirror Descent with novel optimism configuration, and provide minimax-optimal dynamic regret characterization for bid sequences satisfying regularity constraints.

Result: The method achieves minimax-optimal dynamic regret rates and outperforms existing approaches on synthetic datasets, validating theoretical guarantees.

Conclusion: The proposed framework effectively handles non-stationarity in first-price auctions through dynamic regret minimization and novel regularity metrics, providing superior performance compared to static benchmark approaches.

Abstract: First-price auctions have recently gained significant traction in digital advertising markets, exemplified by Google’s transition from second-price to first-price auctions. Unlike in second-price auctions, where bidding one’s private valuation is a dominant strategy, determining an optimal bidding strategy in first-price auctions is more complex. From a learning perspective, the learner (a specific bidder) can interact with the environment (other bidders, i.e., opponents) sequentially to infer their behaviors. Existing research often assumes specific environmental conditions and benchmarks performance against the best fixed policy (static benchmark). While this approach ensures strong learning guarantees, the static benchmark can deviate significantly from the optimal strategy in environments with even mild non-stationarity. To address such scenarios, a dynamic benchmark–representing the sum of the highest achievable rewards at each time step–offers a more suitable objective. However, achieving no-regret learning with respect to the dynamic benchmark requires additional constraints. By inspecting reward functions in online first-price auctions, we introduce two metrics to quantify the regularity of the sequence of opponents’ highest bids, which serve as measures of non-stationarity. We provide a minimax-optimal characterization of the dynamic regret for the class of sequences of opponents’ highest bids that satisfy either of these regularity constraints. Our main technical tool is the Optimistic Mirror Descent (OMD) framework with a novel optimism configuration, which is well-suited for achieving minimax-optimal dynamic regret rates in this context. We then use synthetic datasets to validate our theoretical guarantees and demonstrate that our methods outperform existing ones.

[998] Computing Exact Shapley Values in Polynomial Time for Product-Kernel Methods

Majid Mohammadi, Siu Lun Chau, Krikamol Muandet

Main category: cs.LG

TL;DR: PKeX-Shapley enables exact computation of Shapley values for product-kernel models in polynomial time, overcoming the intractability of existing approximation methods.

DetailsMotivation: Kernel methods lack interpretability, limiting their use in high-stakes applications. Current Shapley value approximations incur unavoidable errors, creating a need for exact computation methods.

Method: Uses multiplicative structure of product kernels with a new functional baseline value function that removes feature influence by setting functional components to least informative state, enabling recursive polynomial-time computation.

Result: Provides exact Shapley value computation in polynomial time for product-kernel models, extending beyond predictive modeling to kernel-based statistical inference measures like MMD and HSIC.

Conclusion: PKeX-Shapley offers a framework for exact, interpretable feature attribution in kernel methods and kernel-based statistical inference, addressing key limitations of existing approximation approaches.

Abstract: Kernel methods are widely used in machine learning due to their flexibility and expressiveness. However, their black-box nature poses significant challenges to interpretability, limiting their adoption in high-stakes applications. Shapley value-based feature attribution techniques, such as SHAP and kernel method-specific adaptation like RKHS-SHAP, offer a promising path toward explainability. Yet, computing exact Shapley values is generally intractable, leading existing methods to rely on approximations and thereby incur unavoidable error. In this work, we introduce PKeX-Shapley, a novel algorithm that utilizes the multiplicative structure of product kernels to enable the exact computation of Shapley values in polynomial time. The core of our approach is a new value function, the functional baseline value function, specifically designed for product-kernel models. This value function removes the influence of a feature subset by setting its functional component to the least informative state. Crucially, it allows a recursive thus efficient computation of Shapley values in polynomial time. As an important additional contribution, we show that our framework extends beyond predictive modeling to statistical inference. In particular, it generalizes to popular kernel-based discrepancy measures such as the Maximum Mean Discrepancy (MMD) and the Hilbert-Schmidt Independence Criterion (HSIC), thereby providing new tools for interpretable statistical inference.

[999] Universal Adversarial Perturbation Attacks On Modern Behavior Cloning Policies

Akansha Kalra, Basavasagar Patil, Guanhong Tao, Daniel S. Brown

Main category: cs.LG

TL;DR: This paper presents the first systematic study of adversarial attacks on Learning from Demonstration (LfD) algorithms, revealing their high vulnerability to universal perturbations that are transferable across algorithms and tasks.

DetailsMotivation: While LfD algorithms show promise in robotic manipulation, their vulnerability to offline universal perturbation attacks remains underexplored, raising security concerns.

Method: Comprehensive study of adversarial attacks on multiple LfD algorithms including Behavior Cloning, LSTM-GMM, Implicit Behavior Cloning, Diffusion Policy, and VQ-BET, testing both white-box and black-box attacks on simulated robotic manipulation tasks.

Result: Most current LfD methods are highly vulnerable to adversarial perturbations, and these attacks are often transferable across algorithms, architectures, and tasks, creating significant security vulnerabilities.

Conclusion: The findings highlight critical vulnerabilities in modern behavior cloning algorithms and pave the way for future work to address these security limitations in LfD systems.

Abstract: Learning from Demonstration (LfD) algorithms have shown promising results in robotic manipulation tasks, but their vulnerability to offline universal perturbation attacks remains underexplored. This paper presents a comprehensive study of adversarial attacks on both classic and recently proposed algorithms, including Behavior Cloning (BC), LSTM-GMM, Implicit Behavior Cloning (IBC), Diffusion Policy (DP), and Vector-Quantizied Behavior Transformer (VQ-BET). We study the vulnerability of these methods to universal adversarial perturbations. Our experiments on several simulated robotic manipulation tasks reveal that most of the current methods are highly vulnerable to adversarial perturbations. We also show that these attacks are often transferable across algorithms, architectures, and tasks, raising concerning security vulnerabilities to black-box attacks. To the best of our knowledge, we are the first to present a systematic study of the vulnerabilities of different LfD algorithms to both white-box and black-box attacks. Our findings highlight the vulnerabilities of modern BC algorithms, paving the way for future work in addressing such limitations.

[1000] Advancing Brainwave Modeling with a Codebook-Based Foundation Model

Konstantinos Barmpas, Na Lee, Yannis Panagakis, Dimitrios A. Adamos, Nikolaos Laskaris, Stefanos Zafeiriou

Main category: cs.LG

TL;DR: LaBraM++ is an enhanced large brainwave foundation model that improves EEG representation through better architectural design, achieving superior performance across BCI tasks while maintaining training efficiency.

DetailsMotivation: Existing pre-trained EEG models struggle to fully capture neural oscillation information due to suboptimal architectural designs, limiting their performance and generalizability in BCI applications.

Method: Introduces LaBraM++ with principled improvements based on robust signal processing foundations to enhance representational capacity of neural oscillations.

Result: Demonstrates substantial performance gains across various tasks, consistently outperforming its original architecture and achieving competitive results compared to other open-source large brainwave foundation models.

Conclusion: LaBraM++ shows superior performance and training efficiency, making it a strong foundation for future advancements in large brainwave foundation models.

Abstract: Recent advances in large-scale pre-trained Electroencephalogram (EEG) models have shown great promise, driving progress in Brain-Computer Interfaces (BCIs) and healthcare applications. However, despite their success, many existing pre-trained models have struggled to fully capture the rich information content of neural oscillations, a limitation that fundamentally constrains their performance and generalizability across diverse BCI tasks. This limitation is frequently rooted in suboptimal architectural design choices which constrain their representational capacity. In this work, we introduce LaBraM++, an enhanced Large Brainwave Foundation Model (LBM) that incorporates principled improvements grounded in robust signal processing foundations. LaBraM++ demonstrates substantial gains across a variety of tasks, consistently outperforming its originally-based architecture and achieving competitive results when compared to other open-source LBMs. Its superior performance and training efficiency highlight its potential as a strong foundation for future advancements in LBMs.

[1001] From Restless to Contextual: A Thresholding Bandit Reformulation For Finite-horizon Performance

Jiamin Xu, Ivan Nazarov, Aditya Rastogi, África Periáñez, Kyra Gan

Main category: cs.LG

TL;DR: The paper introduces a reformulation of online restless bandits as budgeted thresholding contextual bandits to improve finite-horizon performance by reducing sample complexity and achieving faster convergence.

DetailsMotivation: Existing online restless bandit algorithms suffer from poor finite-horizon performance due to prohibitive sample complexity of learning full MDPs for each agent. Superior finite-horizon performance requires rapid convergence to high-quality policies.

Method: Reformulate online restless bandits as budgeted thresholding contextual bandits, encoding long-term state transitions into scalar rewards. Propose a practical learning policy for heterogeneous-agent, multi-state settings.

Result: Achieves sublinear regret with faster convergence than existing methods, leading to higher cumulative reward. Empirical validation shows significant gains over state-of-the-art algorithms in large-scale heterogeneous environments.

Conclusion: Provides a new pathway for achieving practical, sample-efficient learning in finite-horizon restless bandits by simplifying the learning problem and enabling rapid convergence to high-quality policies.

Abstract: This paper addresses the poor finite-horizon performance of existing online \emph{restless bandit} (RB) algorithms, which stems from the prohibitive sample complexity of learning a full \emph{Markov decision process} (MDP) for each agent. We argue that superior finite-horizon performance requires \emph{rapid convergence} to a \emph{high-quality} policy. Thus motivated, we introduce a reformulation of online RBs as a \emph{budgeted thresholding contextual bandit}, which simplifies the learning problem by encoding long-term state transitions into a scalar reward. We prove the first non-asymptotic optimality of an oracle policy for a simplified finite-horizon setting. We propose a practical learning policy under a heterogeneous-agent, multi-state setting, and show that it achieves a sublinear regret, achieving \emph{faster convergence} than existing methods. This directly translates to higher cumulative reward, as empirically validated by significant gains over state-of-the-art algorithms in large-scale heterogeneous environments. Our work provides a new pathway for achieving practical, sample-efficient learning in finite-horizon RBs.

[1002] Improved Sample Complexity For Diffusion Model Training Without Empirical Risk Minimizer Access

Mudit Gaur, Prashant Trivedi, Sasidhar Kunapuli, Amrit Singh Bedi, Vaneet Aggarwal

Main category: cs.LG

TL;DR: The paper presents a theoretical framework for discrete-state diffusion models, achieving a state-of-the-art sample complexity bound of O(ε^-4) and decomposing score estimation error into statistical and optimization components.

DetailsMotivation: Discrete-state diffusion models are crucial for text, sequences, and combinatorial structures but remain theoretically understudied compared to continuous-state models, with existing analyses relying on unrealistic assumptions about empirical risk minimizers.

Method: The authors develop a principled theoretical framework that provides a structured decomposition of score estimation error into statistical and optimization components, addressing the fundamental gap in discrete-state diffusion model analysis.

Result: The framework achieves a sample complexity bound of O(ε^-4), which is state-of-the-art, and offers critical insights into efficient training of diffusion models by analyzing the error components systematically.

Conclusion: This work establishes the theoretical tractability and practical relevance of diffusion models by providing rigorous analysis of discrete-state models, filling a fundamental gap in the literature and enabling more efficient training approaches.

Abstract: Diffusion models have demonstrated remarkable performance in generating high-dimensional samples across domains such as vision, language, and the sciences. Although continuous-state diffusion models have been extensively studied both empirically and theoretically, discrete-state diffusion models, essential for applications involving text, sequences, and combinatorial structures, they remain significantly less understood from a theoretical standpoint. In particular, all existing analyses of discrete-state models assume access to an empirical risk minimizer. In this work, we present a principled theoretical framework analyzing diffusion models, providing a state-of-the-art sample complexity bound of $\widetilde{\mathcal{O}}(\epsilon^{-4})$. Our structured decomposition of the score estimation error into statistical and optimization components offers critical insights into how diffusion models can be trained efficiently. This analysis addresses a fundamental gap in the literature and establishes the theoretical tractability and practical relevance of diffusion models.

[1003] Mamba base PKD for efficient knowledge compression

José Medina, Amnir Hadachi, Paul Honeine, Abdelaziz Bensrhair

Main category: cs.LG

TL;DR: Integrating Mamba Architecture with Progressive Knowledge Distillation to reduce DNN complexity while maintaining accuracy in image classification.

DetailsMotivation: Address the challenge of deploying large DNNs in resource-constrained environments by reducing model size and computational complexity without sacrificing accuracy.

Method: Progressive Knowledge Distillation framework that distills large teacher models into smaller student models using Mamba blocks with Selective-State-Space Models (S-SSM) to focus on important input aspects.

Result: On MNIST: Student group retained 63% of teacher’s FLOPs with 98% accuracy; weak student used only 1% FLOPs with 72% accuracy. On CIFAR-10: Students achieved 1% less accuracy than teacher, with small student using 5% FLOPs to achieve 50% accuracy.

Conclusion: Mamba Architecture successfully integrates with PKD to create efficient student models, providing a scalable solution for deploying complex neural networks in real-time applications with reduced computational costs.

Abstract: Deep neural networks (DNNs) have remarkably succeeded in various image processing tasks. However, their large size and computational complexity present significant challenges for deploying them in resource-constrained environments. This paper presents an innovative approach for integrating Mamba Architecture within a Progressive Knowledge Distillation (PKD) process to address the challenge of reducing model complexity while maintaining accuracy in image classification tasks. The proposed framework distills a large teacher model into progressively smaller student models, designed using Mamba blocks. Each student model is trained using Selective-State-Space Models (S-SSM) within the Mamba blocks, focusing on important input aspects while reducing computational complexity. The work’s preliminary experiments use MNIST and CIFAR-10 as datasets to demonstrate the effectiveness of this approach. For MNIST, the teacher model achieves 98% accuracy. A set of seven student models as a group retained 63% of the teacher’s FLOPs, approximating the teacher’s performance with 98% accuracy. The weak student used only 1% of the teacher’s FLOPs and maintained 72% accuracy. Similarly, for CIFAR-10, the students achieved 1% less accuracy compared to the teacher, with the small student retaining 5% of the teacher’s FLOPs to achieve 50% accuracy. These results confirm the flexibility and scalability of Mamba Architecture, which can be integrated into PKD, succeeding in the process of finding students as weak learners. The framework provides a solution for deploying complex neural networks in real-time applications with a reduction in computational cost.

[1004] InfMasking: Unleashing Synergistic Information by Contrastive Multimodal Interactions

Liangjian Wen, Qun Dai, Jianzhuang Liu, Jiangtao Zheng, Yong Dai, Dongkai Wang, Zhao Kang, Jun Wang, Zenglin Xu, Jiang Duan

Main category: cs.LG

TL;DR: InfMasking is a contrastive synergistic information extraction method that uses infinite masking to enhance multimodal representation learning by capturing richer synergistic interactions between modalities.

DetailsMotivation: Existing multimodal methods struggle to capture the full spectrum of synergistic information between modalities, which is crucial for creating unique outcomes that no single modality can achieve alone.

Method: InfMasking stochastically occludes most features from each modality during fusion, preserving only partial information to create varied synergistic patterns. It aligns unmasked fused representations with masked ones through mutual information maximization using a derived InfMasking loss.

Result: InfMasking effectively enhances synergistic information between modalities and achieves state-of-the-art performance across seven benchmarks on large-scale real-world datasets.

Conclusion: The infinite masking strategy enables capturing richer synergistic interactions by exposing models to diverse partial modality combinations, addressing the fundamental challenge in multimodal representation learning.

Abstract: In multimodal representation learning, synergistic interactions between modalities not only provide complementary information but also create unique outcomes through specific interaction patterns that no single modality could achieve alone. Existing methods may struggle to effectively capture the full spectrum of synergistic information, leading to suboptimal performance in tasks where such interactions are critical. This is particularly problematic because synergistic information constitutes the fundamental value proposition of multimodal representation. To address this challenge, we introduce InfMasking, a contrastive synergistic information extraction method designed to enhance synergistic information through an Infinite Masking strategy. InfMasking stochastically occludes most features from each modality during fusion, preserving only partial information to create representations with varied synergistic patterns. Unmasked fused representations are then aligned with masked ones through mutual information maximization to encode comprehensive synergistic information. This infinite masking strategy enables capturing richer interactions by exposing the model to diverse partial modality combinations during training. As computing mutual information estimates with infinite masking is computationally prohibitive, we derive an InfMasking loss to approximate this calculation. Through controlled experiments, we demonstrate that InfMasking effectively enhances synergistic information between modalities. In evaluations on large-scale real-world datasets, InfMasking achieves state-of-the-art performance across seven benchmarks. Code is released at https://github.com/brightest66/InfMasking.

[1005] Interpretable Visualizations of Data Spaces for Classification Problems

Christian Jorgensen, Arthur Y. Lin, Rhushil Vasavada, Rose K. Cersonsky

Main category: cs.LG

TL;DR: A hybrid supervised-unsupervised technique for visualizing decision boundaries in classification models, demonstrated on chemical neurotoxicity data.

DetailsMotivation: Current visualization techniques make it difficult to understand how classification models perceive data boundaries, despite their success in delineating behaviors.

Method: Proposed a hybrid supervised-unsupervised technique specifically designed for visualizing decision boundaries determined by classification problems.

Result: The method provides human-interpretable maps that can be analyzed both qualitatively and quantitatively, successfully demonstrated through visualizing decision boundaries for chemical neurotoxicity.

Conclusion: While developed in the context of chemistry problems, this visualization method can be generalized across subfields to “unbox” the operations of machine-learning classification models.

Abstract: How do classification models “see” our data? Based on their success in delineating behaviors, there must be some lens through which it is easy to see the boundary between classes; however, our current set of visualization techniques makes this prospect difficult. In this work, we propose a hybrid supervised-unsupervised technique distinctly suited to visualizing the decision boundaries determined by classification problems. This method provides a human-interpretable map that can be analyzed qualitatively and quantitatively, which we demonstrate through visualizing and interpreting a decision boundary for chemical neurotoxicity. While we discuss this method in the context of chemistry-driven problems, its application can be generalized across subfields for “unboxing” the operations of machine-learning classification models.

[1006] Behavior Injection: Preparing Language Models for Reinforcement Learning

Zhepeng Cen, Yihang Yao, William Han, Zuxin Liu, Ding Zhao

Main category: cs.LG

TL;DR: The paper analyzes why RL finetuning works inconsistently on LLMs and proposes behavior injection to make models more RL-ready before RL training.

DetailsMotivation: RL finetuning shows inconsistent results on LLMs - some models improve significantly while others plateau or degrade, which needs investigation.

Method: Propose behavior injection - a task-agnostic data augmentation scheme that enriches SFT data with exploratory and exploitative behaviors before RL training.

Result: Evaluation across two reasoning benchmarks shows the method significantly increases performance gains from RL over pre-RL models.

Conclusion: Behavior injection effectively prepares LLMs for RL finetuning by addressing key conditions for successful post-training.

Abstract: Reinforcement learning (RL) has emerged as a powerful post-training technique to incentivize the reasoning ability of large language models (LLMs). However, LLMs can respond very inconsistently to RL finetuning: some show substantial performance gains, while others plateau or even degrade. To understand this divergence, we analyze the per-step influence of the RL objective and identify two key conditions for effective post-training: (1) RL-informative rollout accuracy, and (2) strong data co-influence, which quantifies how much the training data affects performance on other samples. Guided by these insights, we propose behavior injection, a task-agnostic data augmentation scheme applied prior to RL. Behavior injection enriches the supervised finetuning (SFT) data by seeding exploratory and exploitative behaviors, effectively making the model more RL-ready. We evaluate our method across two reasoning benchmarks with multiple base models. The results demonstrate that our theoretically motivated augmentation can significantly increase the performance gain from RL over the pre-RL model.

[1007] Dependency-aware Maximum Likelihood Estimation for Active Learning

Beyza Kalkanli, Tales Imbiriba, Stratis Ioannidis, Deniz Erdogmus, Jennifer Dy

Main category: cs.LG

TL;DR: DMLE corrects MLE in active learning by addressing sample dependencies typically neglected due to i.i.d. assumption, achieving superior performance across multiple benchmark datasets.

DetailsMotivation: Active learning creates dependencies among samples in the labeled set due to sequential selection, but conventional MLE assumes i.i.d. data, overlooking these dependencies during model parameter estimation.

Method: Proposed Dependency-aware MLE (DMLE) that corrects MLE within active learning framework by addressing sample dependencies to ensure consistency with active learning principles.

Result: Achieved superior performance across multiple benchmark datasets, with average accuracy improvements of 6%, 8.6%, and 10.5% for k=1, k=5, and k=10 respectively after collecting first 100 samples.

Conclusion: DMLE effectively addresses sample dependencies in active learning, outperforming conventional MLE and reaching higher performance in earlier cycles.

Abstract: Active learning aims to efficiently build a labeled training set by strategically selecting samples to query labels from annotators. In this sequential process, each sample acquisition influences subsequent selections, causing dependencies among samples in the labeled set. However, these dependencies are overlooked during the model parameter estimation stage when updating the model using Maximum Likelihood Estimation (MLE), a conventional method that assumes independent and identically distributed (i.i.d.) data. We propose Dependency-aware MLE (DMLE), which corrects MLE within the active learning framework by addressing sample dependencies typically neglected due to the i.i.d. assumption, ensuring consistency with active learning principles in the model parameter estimation process. This improved method achieves superior performance across multiple benchmark datasets, reaching higher performance in earlier cycles compared to conventional MLE. Specifically, we observe average accuracy improvements of 6%, 8.6%, and 10.5% for k=1, k=5, and k=10 respectively, after collecting the first 100 samples, where entropy is the acquisition function and k is the query batch size acquired at every active learning cycle. Our implementation is publicly available at: https://github.com/neu-spiral/DMLEforAL

[1008] MoESD: Unveil Speculative Decoding’s Potential for Accelerating Sparse MoE

Zongle Huang, Lei Zhu, Zongyuan Zhan, Ting Hu, Weikai Mao, Xianzhi Yu, Yongpan Liu, Tianyu Zhang

Main category: cs.LG

TL;DR: Speculative decoding (SD) provides greater acceleration for Mixture of Experts (MoE) models than dense models, especially at medium batch sizes and with sparser MoE architectures.

DetailsMotivation: While speculative decoding is known to accelerate dense LLMs, its effectiveness for MoE models was unclear. The authors discovered that MoE models surprisingly benefit more from SD than dense models, particularly as MoE designs become sparser.

Method: Developed a theoretical modeling framework to analyze SD tradeoffs, introduced a new ’target efficiency’ metric to characterize system bottlenecks, and conducted experiments on different GPUs with MoE models like Qwen2-57B-A14B.

Result: Achieved up to 2.29x speedup for Qwen2-57B-A14B at medium batch sizes, with broader effective batch size ranges for sparser MoE architectures. The new target efficiency metric helps identify system bottlenecks more comprehensively.

Conclusion: This work reveals a new perspective for accelerating MoE inference in private serving scenarios where existing solutions struggle, showing that SD is particularly effective for MoE models and becomes more beneficial as MoE architectures become sparser.

Abstract: Large Language Models (LLMs) have achieved remarkable success across many applications, with Mixture of Experts (MoE) models demonstrating great potential. Compared to traditional dense models, MoEs achieve better performance with less computation. Speculative decoding (SD) is a widely used technique to accelerate LLM inference without accuracy loss, but it has been considered efficient only for dense models. In this work, we first demonstrate that, under medium batch sizes, MoE surprisingly benefits more from SD than dense models. Furthermore, as MoE becomes sparser – the prevailing trend in MoE designs – the batch size range where SD acceleration is expected to be effective becomes broader. To quantitatively understand tradeoffs involved in SD, we develop a reliable modeling based on theoretical analyses. While current SD research primarily focuses on improving acceptance rates of algorithms, changes in workload and model architecture can still lead to degraded SD acceleration even with high acceptance rates. To address this limitation, we introduce a new metric ’target efficiency’ that characterizes these effects, thus helping researchers identify system bottlenecks and understand SD acceleration more comprehensively. For scenarios like private serving, this work unveils a new perspective to speed up MoE inference, where existing solutions struggle. Experiments on different GPUs show up to 2.29x speedup for Qwen2-57B-A14B at medium batch sizes and validate our theoretical predictions.

[1009] Exact and Linear Convergence for Federated Learning under Arbitrary Client Participation is Attainable

Bicheng Ying, Zhe Li, Haibo Yang

Main category: cs.LG

TL;DR: FOCUS is a novel FL algorithm that achieves exact linear convergence despite arbitrary client participation and data heterogeneity, using stochastic matrix modeling and push-pull strategy.

DetailsMotivation: Address fundamental FL challenges: arbitrary client participation and data heterogeneity that cause FedAvg-style algorithms to struggle with exact convergence and slow rates.

Method: Introduce stochastic matrix and time-varying graphs to model client participation dynamics, then design FOCUS algorithm using decentralized perspective with push-pull strategy.

Result: Rigorous proof shows FOCUS achieves exact convergence with linear rate regardless of arbitrary client participation - first work to demonstrate this result.

Conclusion: FOCUS effectively overcomes FL challenges and establishes new benchmark for convergence guarantees in practical FL settings.

Abstract: This work tackles the fundamental challenges in Federated Learning (FL) posed by arbitrary client participation and data heterogeneity, prevalent characteristics in practical FL settings. It is well-established that popular FedAvg-style algorithms struggle with exact convergence and can suffer from slow convergence rates since a decaying learning rate is required to mitigate these scenarios. To address these issues, we introduce the concept of stochastic matrix and the corresponding time-varying graphs as a novel modeling tool to accurately capture the dynamics of arbitrary client participation and the local update procedure. Leveraging this approach, we offer a fresh decentralized perspective on designing FL algorithms and present FOCUS, Federated Optimization with Exact Convergence via Push-pull Strategy, a provably convergent algorithm designed to effectively overcome the previously mentioned two challenges. More specifically, we provide a rigorous proof demonstrating that FOCUS achieves exact convergence with a linear rate regardless of the arbitrary client participation, establishing it as the first work to demonstrate this significant result.

[1010] Learning and Transferring Physical Models through Derivatives

Alessandro Trenta, Andrea Cossu, Davide Bacciu

Main category: cs.LG

TL;DR: DERL is a supervised approach that learns physical systems by modeling their partial derivatives, with theoretical guarantees of consistency with physical laws and the ability to incrementally build models through knowledge distillation.

DetailsMotivation: To develop a method that can learn physical systems while maintaining consistency with underlying physical laws, and enable incremental model building through knowledge transfer across different portions of the physical domain and parameter ranges.

Method: Proposes Derivative Learning (DERL) which models physical systems by learning partial derivatives, includes a distillation protocol for transferring knowledge from pre-trained models to student models, and provides theoretical guarantees for learning true physical systems even with empirical derivatives.

Result: DERL outperforms state-of-the-art methods in generalizing ODEs to unseen initial conditions and parametric PDEs to unseen parameters. Successfully demonstrates knowledge transfer across models for new physical domains and PDE parameter ranges.

Conclusion: DERL represents the first attempt at building physical models incrementally in multiple stages, providing a framework for consistent physical learning and effective knowledge transfer across different modeling scenarios.

Abstract: We propose Derivative Learning (DERL), a supervised approach that models physical systems by learning their partial derivatives. We also leverage DERL to build physical models incrementally, by designing a distillation protocol that effectively transfers knowledge from a pre-trained model to a student one. We provide theoretical guarantees that DERL can learn the true physical system, being consistent with the underlying physical laws, even when using empirical derivatives. DERL outperforms state-of-the-art methods in generalizing an ODE to unseen initial conditions and a parametric PDE to unseen parameters. We also design a method based on DERL to transfer physical knowledge across models by extending them to new portions of the physical domain and a new range of PDE parameters. We believe this is the first attempt at building physical models incrementally in multiple stages.

[1011] Taming OOD Actions for Offline Reinforcement Learning: An Advantage-Based Approach

Xuyang Chen, Keyu Yan, Wenhan Cao, Lin Zhao

Main category: cs.LG

TL;DR: ADAC is an offline RL method that uses advantage-based modulation to selectively evaluate OOD actions, addressing distribution shift while enabling better generalization than conservative approaches.

DetailsMotivation: Offline RL suffers from distribution shift causing inaccurate evaluation and overestimation of OOD actions. Existing conservative methods limit generalization by discouraging all OOD actions.

Method: Uses advantage-like function to evaluate OOD actions and modulate Q-function update discriminatively. Leverages the insight that state value function is learned more reliably than action-value function, using next-state value to assess actions.

Result: Developed PointMaze environment to visualize advantage modulation selecting superior OOD actions while discouraging inferior ones. Achieved state-of-the-art performance on D4RL benchmark with strong gains on challenging tasks.

Conclusion: ADAC effectively addresses distribution shift in offline RL through discriminative advantage modulation, enabling better generalization than conservative approaches while maintaining strong performance.

Abstract: Offline reinforcement learning (RL) learns policies from fixed datasets without online interactions, but suffers from distribution shift, causing inaccurate evaluation and overestimation of out-of-distribution (OOD) actions. Existing methods counter this by conservatively discouraging all OOD actions, which limits generalization. We propose Advantage-based Diffusion Actor-Critic (ADAC), which evaluates OOD actions via an advantage-like function and uses it to modulate the Q-function update discriminatively. Our key insight is that the (state) value function is generally learned more reliably than the action-value function; we thus use the next-state value to indirectly assess each action. We develop a PointMaze environment to clearly visualize that advantage modulation effectively selects superior OOD actions while discouraging inferior ones. Moreover, extensive experiments on the D4RL benchmark show that ADAC achieves state-of-the-art performance, with especially strong gains on challenging tasks.

[1012] Learning Penalty for Optimal Partitioning via Automatic Feature Extraction

Tung L Nguyen, Toby Hocking

Main category: cs.LG

TL;DR: A novel method using recurrent networks to learn penalty parameters for changepoint detection, outperforming traditional feature-based approaches on genomic datasets.

DetailsMotivation: Traditional changepoint detection requires manual feature extraction for penalty parameter selection, which is challenging and suboptimal.

Method: Using recurrent networks to automatically learn penalty parameters directly from raw data sequences, eliminating manual feature engineering.

Result: The proposed method generally outperforms traditional approaches in changepoint detection accuracy on 20 benchmark genomic datasets.

Conclusion: Recurrent networks provide an effective automated approach for learning optimal penalty parameters in changepoint detection, improving accuracy over traditional methods.

Abstract: Changepoint detection identifies significant shifts in data sequences, making it important in areas like finance, genetics, and healthcare. The Optimal Partitioning algorithms efficiently detect these changes, using a penalty parameter to limit the changepoints count. Determining the optimal value for this penalty can be challenging. Traditionally, this process involved manually extracting statistical features, such as sequence length or variance to make the prediction. This study proposes a novel approach that uses recurrent networks to learn this penalty directly from raw sequences by automatically extracting features. Experiments conducted on 20 benchmark genomic datasets show that this novel method generally outperforms traditional ones in changepoint detection accuracy.

[1013] In-Context Learning for Pure Exploration

Alessio Russo, Ryan Welch, Aldo Pacchiano

Main category: cs.LG

TL;DR: This paper introduces In-Context Pure Exploration (ICPE), which meta-trains Transformers to perform active sequential hypothesis testing by mapping observation histories to query actions and predicted hypotheses, enabling transfer learning without parameter updates.

DetailsMotivation: To address the problem of active sequential hypothesis testing (pure exploration) where learners adaptively collect data to determine underlying correct hypotheses, including tasks like best-arm identification in multi-armed bandits and generalized search problems.

Method: Meta-train Transformers to map observation histories to query actions and predicted hypotheses, creating an in-context learning approach that transfers across tasks without requiring parameter updates during inference.

Result: ICPE performs competitively with adaptive baselines across deterministic, stochastic, and structured benchmarks including best-arm identification and generalized search, without explicit modeling of information structure.

Conclusion: Transformers serve as practical architectures for general sequential testing tasks, demonstrating effective in-context learning capabilities for pure exploration problems.

Abstract: We study the problem active sequential hypothesis testing, also known as pure exploration: given a new task, the learner adaptively collects data from the environment to efficiently determine an underlying correct hypothesis. A classical instance of this problem is the task of identifying the best arm in a multi-armed bandit problem (a.k.a. BAI, Best-Arm Identification), where actions index hypotheses. Another important case is generalized search, a problem of determining the correct label through a sequence of strategically selected queries that indirectly reveal information about the label. In this work, we introduce In-Context Pure Exploration (ICPE), which meta-trains Transformers to map observation histories to query actions and a predicted hypothesis, yielding a model that transfers in-context. At inference time, ICPE actively gathers evidence on new tasks and infers the true hypothesis without parameter updates. Across deterministic, stochastic, and structured benchmarks, including BAI and generalized search, ICPE is competitive with adaptive baselines while requiring no explicit modeling of information structure. Our results support Transformers as practical architectures for general sequential testing.

[1014] Density Ratio-based Causal Discovery from Bivariate Continuous-Discrete Data

Takashi Nicholas Maeda, Shohei Shimizu, Hidetoshi Matsui

Main category: cs.LG

TL;DR: A novel causal discovery method for mixed bivariate data (continuous + discrete variables) that determines causal direction by analyzing monotonicity of conditional density ratios, avoiding strong distributional assumptions and information content bias.

DetailsMotivation: Existing methods for mixed bivariate causal discovery either impose strong distributional assumptions or struggle with fair comparison between variables of different types due to differences in their information content.

Method: The approach analyzes monotonicity of the conditional density ratio of the continuous variable conditioned on different values of the discrete variable. Theoretical analysis shows this ratio exhibits monotonicity when continuous causes discrete, but not in reverse.

Result: Experiments on synthetic and real-world datasets demonstrate superior accuracy compared to existing methods.

Conclusion: The method provides a principled basis for comparing causal directions between variables of different types, free from strong distributional assumptions and information content bias.

Abstract: We propose a causal discovery method for mixed bivariate data consisting of one continuous and one discrete variable. Existing approaches either impose strong distributional assumptions or face challenges in fairly comparing causal directions between variables of different types, due to differences in their information content. We introduce a novel approach that determines causal direction by analyzing the monotonicity of the conditional density ratio of the continuous variable, conditioned on different values of the discrete variable. Our theoretical analysis shows that the conditional density ratio exhibits monotonicity when the continuous variable causes the discrete variable, but not in the reverse direction. This property provides a principled basis for comparing causal directions between variables of different types, free from strong distributional assumptions and bias arising from differences in their information content. We demonstrate its effectiveness through experiments on both synthetic and real-world datasets, showing superior accuracy compared to existing methods.

[1015] SALAD: Systematic Assessment of Machine Unlearning on LLM-Aided Hardware Design

Zeng Wang, Minghao Shao, Rupesh Karn, Likhitha Mankali, Jitendra Bhandari, Ramesh Karri, Ozgur Sinanoglu, Muhammad Shafique, Johann Knechtel

Main category: cs.LG

TL;DR: SALAD uses machine unlearning to address data security risks in LLM-aided hardware design, enabling selective removal of contaminated benchmarks, sensitive IP, and malicious code from pre-trained models without full retraining.

DetailsMotivation: LLMs offer transformative capabilities for hardware design automation but pose significant data security challenges including Verilog evaluation data contamination, IP design leakage, and malicious Verilog generation risks.

Method: Introduces SALAD, a comprehensive assessment that leverages machine unlearning techniques to selectively remove contaminated benchmarks, sensitive IP/design artifacts, or malicious code patterns from pre-trained LLMs.

Result: Machine unlearning techniques effectively reduce data security risks in LLM-aided hardware design, as demonstrated through detailed case studies.

Conclusion: Machine unlearning provides an effective approach to mitigate data security threats in LLM-based hardware design automation without requiring full model retraining.

Abstract: Large Language Models (LLMs) offer transformative capabilities for hardware design automation, particularly in Verilog code generation. However, they also pose significant data security challenges, including Verilog evaluation data contamination, intellectual property (IP) design leakage, and the risk of malicious Verilog generation. We introduce SALAD, a comprehensive assessment that leverages machine unlearning to mitigate these threats. Our approach enables the selective removal of contaminated benchmarks, sensitive IP and design artifacts, or malicious code patterns from pre-trained LLMs, all without requiring full retraining. Through detailed case studies, we demonstrate how machine unlearning techniques effectively reduce data security risks in LLM-aided hardware design.

[1016] A Case for Library-Level k-Means Binning in Histogram Gradient-Boosted Trees

Asher Labovich

Main category: cs.LG

TL;DR: Replacing quantile binning with k-means discretizer in GBDTs improves predictive performance, especially for regression tasks with skewed data or limited bin budgets, while maintaining comparable performance in classification tasks.

DetailsMotivation: Quantile binning in GBDTs may miss critical boundary values that could enhance predictive performance, as it focuses on evenly distributing data points rather than capturing important split points.

Method: Proposes replacing quantile binning with k-means discretizer initialized with quantile bins, with theoretical justification showing k-means maximizes worst-case explained variance for L-Lipschitz functions.

Result: On 18 regression datasets: no significant losses, 55% MSE drop on one skewed dataset; on 15 classification datasets: statistically tied with quantile binning; synthetic experiments show 20-90% MSE gains with outliers or limited bins.

Conclusion: K-means binning is recommended as a “safe default” for GBDTs, especially in regression tasks and tight-budget GPU settings (32-64 bins), as it recovers key split points quantile overlooks with minimal one-off overhead.

Abstract: Modern Gradient Boosted Decision Trees (GBDTs) accelerate split finding with histogram-based binning, which reduces complexity from $O(N\log N)$ to $O(N)$ by aggregating gradients into fixed-size bins. However, the predominant quantile binning strategy - designed to distribute data points evenly among bins – may overlook critical boundary values that could enhance predictive performance. In this work, we consider a novel approach that replaces quantile binning with a $k$-means discretizer initialized with quantile bins, and justify the swap with a proof showing how, for any $L$-Lipschitz function, k-means maximizes the worst-case explained variance of Y obtained when treating all values in a given bin as equivalent. We test this swap against quantile and uniform binning on 33 OpenML datasets plus synthetics that control for modality, skew, and bin budget. Across 18 regression datasets, k-means shows no statistically significant losses at the 5% level and wins in three cases-most strikingly a 55% MSE drop on one particularly skewed dataset-even though k-means’ mean reciprocal rank (MRR) is slightly lower (0.65 vs 0.72). On the 15 classification datasets the two methods are statistically tied (MRR 0.70 vs 0.68) with gaps $\leq$0.2 pp. Synthetic experiments confirm consistently large MSE gains - typically >20% and rising to 90% as outlier magnitude increases or bin budget drops. We find that k-means keeps error on par with exhaustive (no-binning) splitting when extra cuts add little value, yet still recovers key split points that quantile overlooks. As such, we advocate for a built-in bin_method=k-means flag, especially in regression tasks and in tight-budget settings such as the 32-64-bin GPU regime - because it is a “safe default” with large upside, yet adds only a one-off, cacheable overhead ($\approx$ 3.5s per feature to bin 10M rows on one Apple M1 thread).

[1017] Directional Convergence, Benign Overfitting of Gradient Descent in leaky ReLU two-layer Neural Networks

Ichiro Hashimoto

Main category: cs.LG

TL;DR: This paper analyzes benign overfitting in fixed-width leaky ReLU two-layer neural networks trained on mixture data via gradient descent, providing both upper and lower classification error bounds and discovering a phase transition based on signal strength.

DetailsMotivation: To understand when benign overfitting occurs in leaky ReLU neural networks and to relax the restrictive distributional assumptions (sub-Gaussian data, near orthogonality) made in previous work.

Method: Established directional convergence of network parameters and studied classification error bounds for the convergent direction, extending previous gradient flow results to gradient descent training.

Result: Discovered a phase transition in classification error bounds as a function of signal strength, and characterized cases where benign overfitting provably fails even with directional convergence.

Conclusion: The analysis enables studying benign overfitting in a much wider range of scenarios than previously possible, without requiring sub-Gaussian data or near orthogonality assumptions.

Abstract: In this paper, we study benign overfitting of fixed width leaky ReLU two-layer neural network classifiers trained on mixture data via gradient descent. We provide both, upper and lower classification error bounds, and discover a phase transition in the bound as a function of signal strength. The lower bound leads to a characterization of cases when benign overfitting provably fails even if directional convergence occurs. Our analysis allows us to considerably relax the distributional assumptions that are made in existing work on benign overfitting of leaky ReLU two-layer neural network classifiers. We can allow for non-sub-Gaussian data and do not require near orthogonality. Our results are derived by establishing directional convergence of the network parameters and studying classification error bounds for the convergent direction. Previously, directional convergence in (leaky) ReLU neural networks was established only for gradient flow. By first establishing directional convergence, we are able to study benign overfitting of fixed width leaky ReLU two-layer neural network classifiers in a much wider range of scenarios than was done before.

[1018] PhysioWave: A Multi-Scale Wavelet-Transformer for Physiological Signal Representation

Yanlong Chen, Mattia Orlandi, Pierangelo Maria Rapa, Simone Benatti, Luca Benini, Yawei Li

Main category: cs.LG

TL;DR: A novel wavelet-based approach for physiological signal analysis that captures multi-scale time-frequency features, with pretrained models for EMG/ECG and a unified multi-modal framework integrating EEG, addressing noise and variability challenges.

DetailsMotivation: Physiological signals are corrupted by motion artifacts, baseline drift, and low-SNR disturbances, with strong non-stationarity and abrupt changes that traditional methods struggle to represent effectively.

Method: Wavelet-based approach for multi-scale time-frequency feature extraction, pretrained models for EMG and ECG, and unified multi-modal framework with dedicated branches for each modality and learnable weighted fusion.

Result: Achieved superior performance and set new baselines in downstream tasks, effectively addressed low SNR, high inter-subject variability, and device mismatch, outperforming existing methods on multi-modal tasks.

Conclusion: The wavelet-based architecture provides a solid foundation for diverse physiological signal analysis, while the multi-modal design points to next-generation physiological signal processing with potential impact on wearable health monitoring and clinical diagnostics.

Abstract: Physiological signals are often corrupted by motion artifacts, baseline drift, and other low-SNR disturbances, which pose significant challenges for analysis. Additionally, these signals exhibit strong non-stationarity, with sharp peaks and abrupt changes that evolve continuously, making them difficult to represent using traditional time-domain or filtering methods. To address these issues, a novel wavelet-based approach for physiological signal analysis is presented, aiming to capture multi-scale time-frequency features in various physiological signals. Leveraging this technique, two large-scale pretrained models specific to EMG and ECG are introduced for the first time, achieving superior performance and setting new baselines in downstream tasks. Additionally, a unified multi-modal framework is constructed by integrating pretrained EEG model, where each modality is guided through its dedicated branch and fused via learnable weighted fusion. This design effectively addresses challenges such as low signal-to-noise ratio, high inter-subject variability, and device mismatch, outperforming existing methods on multi-modal tasks. The proposed wavelet-based architecture lays a solid foundation for analysis of diverse physiological signals, while the multi-modal design points to next-generation physiological signal processing with potential impact on wearable health monitoring, clinical diagnostics, and broader biomedical applications. Code and data are available at: github.com/ForeverBlue816/PhysioWave

[1019] Towards Coordinate- and Dimension-Agnostic Machine Learning for Partial Differential Equations

Trung V. Phan, George A. Kevrekidis, Soledad Villar, Yannis G. Kevrekidis, Juan M. Bello-Rivas

Main category: cs.LG

TL;DR: The paper proposes a coordinate-free and dimension-independent approach to learning partial differential equations using exterior calculus, enabling learned models to generalize across different spatial dimensions, coordinate systems, and geometries.

DetailsMotivation: Traditional machine learning methods for PDE identification are tied to specific spatial dimensions and coordinate systems, preventing learned evolution equations from generalizing to other spaces.

Method: Employ exterior calculus formalism to represent scalar field systems in a coordinate-free manner, using machine learning to predict field evolution that naturally generalizes across dimensions by construction.

Result: Demonstrated successful performance on FitzHugh-Nagumo, Barkley reaction-diffusion, and Patlak-Keller-Segel models, showing seamless transitions across spatial contexts with different dimensions, coordinate systems, boundary conditions, and curvatures.

Conclusion: The proposed ‘spatially liberated’ PDE learning approach enables field dynamics learned in one space to make accurate predictions in other spaces with varying spatial properties.

Abstract: The machine learning methods for data-driven identification of partial differential equations (PDEs) are typically defined for a given number of spatial dimensions and a choice of coordinates the data have been collected in. This dependence prevents the learned evolution equation from generalizing to other spaces. In this work, we reformulate the problem in terms of coordinate- and dimension-independent representations, paving the way toward what we call ``spatially liberated" PDE learning. To this end, we employ a machine learning approach to predict the evolution of scalar field systems expressed in the formalism of exterior calculus, which is coordinate-free and immediately generalizes to arbitrary dimensions by construction. We demonstrate the performance of this approach in the FitzHugh-Nagumo and Barkley reaction-diffusion models, as well as the Patlak-Keller-Segel model informed by in-situ chemotactic bacteria observations. We provide extensive numerical experiments that demonstrate that our approach allows for seamless transitions across various spatial contexts. We show that the field dynamics learned in one space can be used to make accurate predictions in other spaces with different dimensions, coordinate systems, boundary conditions, and curvatures.

[1020] A Snapshot of Influence: A Local Data Attribution Framework for Online Reinforcement Learning

Yuzheng Hu, Fan Wu, Haotian Ye, David Forsyth, James Zou, Nan Jiang, Jiaqi W. Ma, Han Zhao

Main category: cs.LG

TL;DR: This paper introduces data attribution for online RL, focusing on PPO algorithm. It establishes a local attribution framework to trace model behavior to training samples and proposes IIF algorithm for experience filtering, improving sample efficiency and training performance.

DetailsMotivation: Online RL suffers from sample inefficiency, training instability, and limited interpretability. Existing data attribution methods assume fixed datasets, which doesn't apply to online RL where experiences both update policy and shape future data collection.

Method: Established local attribution framework for PPO, measuring record contributions through gradient similarity between training loss and target functions (agent action and cumulative return). Proposed IIF algorithm that iteratively performs experience filtering to refine policy updates.

Result: IIF reduces sample complexity, speeds up training, and achieves higher returns across standard RL benchmarks (classic control, navigation, locomotion) and RLHF for large language models.

Conclusion: The framework opens a new direction for making online RL more interpretable, efficient, and effective through principled data attribution and experience filtering.

Abstract: Online reinforcement learning (RL) excels in complex, safety-critical domains but suffers from sample inefficiency, training instability, and limited interpretability. Data attribution provides a principled way to trace model behavior back to training samples, yet existing methods assume fixed datasets, which is violated in online RL where each experience both updates the policy and shapes future data collection. In this paper, we initiate the study of data attribution for online RL, focusing on the widely used Proximal Policy Optimization (PPO) algorithm. We start by establishing a \emph{local} attribution framework, interpreting model checkpoints with respect to the records in the recent training buffer. We design two target functions, capturing agent action and cumulative return respectively, and measure each record’s contribution through gradient similarity between its training loss and these targets. We demonstrate the power of this framework through three concrete applications: diagnosis of learning, temporal analysis of behavior formation, and targeted intervention during training. Leveraging this framework, we further propose an algorithm, iterative influence-based filtering (IIF), for online RL training that iteratively performs experience filtering to refine policy updates. Across standard RL benchmarks (classic control, navigation, locomotion) to RLHF for large language models, IIF reduces sample complexity, speeds up training, and achieves higher returns. Together, these results open a new direction for making online RL more interpretable, efficient, and effective.

[1021] Optimas: Optimizing Compound AI Systems with Globally Aligned Local Rewards

Shirley Wu, Parth Sarthi, Shiyu Zhao, Aaron Lee, Herumb Shandilya, Adrian Mladenic Grobelnik, Nurendra Choudhary, Eddie Huang, Karthik Subbian, Linjun Zhang, Diyi Yang, James Zou, Jure Leskovec

Main category: cs.LG

TL;DR: Optimas is a unified framework for optimizing compound AI systems by maintaining Local Reward Functions (LRFs) per component that align with global system performance, enabling independent optimization of heterogeneous configurations while ensuring local improvements translate to global gains.

DetailsMotivation: Compound AI systems integrating multiple components face optimization challenges due to non-differentiable structures and diverse configuration types (prompts, hyperparameters, model parameters) across components.

Method: Maintains one Local Reward Function (LRF) per component with local-global alignment property. In each iteration, adapts LRFs to maintain alignment while maximizing each component’s local reward, enabling independent optimization of heterogeneous configurations.

Result: Extensive evaluations across five real-world compound systems show Optimas outperforms strong baselines by an average improvement of 11.92%.

Conclusion: Optimas provides a general and effective approach for improving compound systems by enabling component-level optimization while maintaining global performance alignment.

Abstract: Compound AI systems integrating multiple components, such as Large Language Models, specialized tools, and traditional machine learning models, are increasingly deployed to solve complex real-world tasks. However, optimizing compound systems remains challenging due to their non-differentiable structures and diverse configuration types across components, including prompts, hyperparameters, and model parameters. To address this challenge, we propose Optimas, a unified framework for effective optimization of compound systems. The core idea of Optimas is to maintain one Local Reward Function (LRF) per component, each satisfying a local-global alignment property, i.e., each component’s local reward correlates with the global system performance. In each iteration, Optimas efficiently adapts the LRFs to maintain this property while simultaneously maximizing each component’s local reward. This approach enables independent updates of heterogeneous configurations using the designated optimization method, while ensuring that local improvements consistently lead to performance gains. We present extensive evaluations across five real-world compound systems to demonstrate that Optimas outperforms strong baselines by an average improvement of 11.92%, offering a general and effective approach for improving compound systems. Our website is at https://optimas.stanford.edu.

[1022] Rethinking Probabilistic Circuit Parameter Learning

Anji Liu, Zilei Shao, Guy Van den Broeck

Main category: cs.LG

TL;DR: Anemone is a new mini-batch EM algorithm for Probabilistic Circuits that addresses overfitting in existing methods by using implicit adaptive learning rates scaled by parameter contributions to batch likelihood.

DetailsMotivation: Existing mini-batch EM and gradient-based methods for Probabilistic Circuits underperform in final likelihood despite faster convergence, due to insufficient regularization of distribution changes that causes overfitting to current mini-batches.

Method: Anemone applies implicit adaptive learning rates to each parameter, scaled by how much it contributes to the likelihood of the current batch, addressing the overfitting problem identified in theoretical analysis.

Result: Across extensive experiments on language, image, and DNA datasets, anemone consistently outperforms existing optimizers in both convergence speed and final performance.

Conclusion: The proposed anemone algorithm successfully bridges the performance gap between full-batch and mini-batch methods for Probabilistic Circuits by properly regularizing distribution changes through adaptive learning rates.

Abstract: Probabilistic Circuits (PCs) offer a computationally scalable framework for generative modeling, supporting exact and efficient inference of a wide range of probabilistic queries. While recent advances have significantly improved the expressiveness and scalability of PCs, effectively training their parameters remains a challenge. In particular, a widely used optimization method, full-batch Expectation-Maximization (EM), requires processing the entire dataset before performing a single update, making it ineffective for large datasets. Although empirical extensions to the mini-batch setting, as well as gradient-based mini-batch algorithms, converge faster than full-batch EM, they generally underperform in terms of final likelihood. We investigate this gap by establishing a novel theoretical connection between these practical algorithms and the general EM objective. Our analysis reveals a fundamental issue that existing mini-batch EM and gradient-based methods fail to properly regularize distribution changes, causing each update to effectively ``overfit’’ the current mini-batch. Motivated by this insight, we introduce anemone, a new mini-batch EM algorithm for PCs. Anemone applies an implicit adaptive learning rate to each parameter, scaled by how much it contributes to the likelihood of the current batch. Across extensive experiments on language, image, and DNA datasets, anemone consistently outperforms existing optimizers in both convergence speed and final performance.

[1023] TolerantECG: A Foundation Model for Imperfect Electrocardiogram

Huynh Dang Nguyen, Trong-Thang Pham, Ngan Le, Van Nguyen

Main category: cs.LG

TL;DR: TolerantECG is a foundation model for ECG signals that is robust to noise and can work with arbitrary subsets of the standard 12-lead ECG, combining contrastive and self-supervised learning.

DetailsMotivation: ECG effectiveness is compromised by noise or unavailability of leads in standard 12-lead recordings, leading to diagnostic errors or uncertainty.

Method: Combines contrastive and self-supervised learning frameworks to jointly learn ECG signal representations with their text report descriptions and corrupted/lead-missing signals.

Result: Consistently ranks as best or second-best performer across various ECG signal conditions and class levels in PTB-XL dataset, and achieves highest performance on MIT-BIH Arrhythmia Database.

Conclusion: TolerantECG effectively addresses challenges of noise and lead unavailability in ECG diagnostics through robust foundation model training.

Abstract: The electrocardiogram (ECG) is an essential and effective tool for diagnosing heart diseases. However, its effectiveness can be compromised by noise or unavailability of one or more leads of the standard 12-lead recordings, resulting in diagnostic errors or uncertainty. To address these challenges, we propose TolerantECG, a foundation model for ECG signals that is robust to noise and capable of functioning with arbitrary subsets of the standard 12-lead ECG. TolerantECG training combines contrastive and self-supervised learning frameworks to jointly learn ECG signal representations alongside their corresponding knowledge-retrieval-based text report descriptions and corrupted or lead-missing signals. Comprehensive benchmarking results demonstrate that TolerantECG consistently ranks as the best or second-best performer across various ECG signal conditions and class levels in the PTB-XL dataset, and achieves the highest performance on the MIT-BIH Arrhythmia Database.

[1024] Spurious Privacy Leakage in Neural Networks

Chenxiang Zhang, Jun Pang, Sjouke Mauw

Main category: cs.LG

TL;DR: Spurious correlation bias in neural networks leads to privacy disparities where spurious groups are more vulnerable to privacy attacks than non-spurious groups, and spurious robust methods fail to mitigate this privacy disparity.

DetailsMotivation: Neural networks trained on real-world data often exhibit biases and are vulnerable to privacy attacks, but the intersection of these two problems remains poorly understood. The paper aims to investigate the privacy impact of spurious correlation bias.

Method: The authors introduce the concept of ‘spurious privacy leakage’ and analyze privacy disparities between groups. They examine how spurious robust methods affect privacy and systematically compare privacy across different model architectures trained with spurious data.

Result: Privacy disparity between groups increases in tasks with simpler objectives due to spurious features. Counterintuitively, spurious robust methods designed to reduce bias fail to mitigate privacy disparity because they reduce reliance on spurious features for prediction but don’t prevent their memorization during training. Architectural choice can affect privacy evaluation.

Conclusion: Spurious correlation bias creates significant privacy disparities, and current robust methods are insufficient to address this issue since they don’t prevent memorization of spurious features during training. Model architecture choice impacts privacy evaluation in spurious data scenarios.

Abstract: Neural networks trained on real-world data often exhibit biases while simultaneously being vulnerable to privacy attacks aimed at extracting sensitive information. Despite extensive research on each problem individually, their intersection remains poorly understood. In this work, we investigate the privacy impact of spurious correlation bias. We introduce \emph{spurious privacy leakage}, a phenomenon in which spurious groups are significantly more vulnerable to privacy attacks than non-spurious groups. We observe that privacy disparity between groups increases in tasks with simpler objectives (e.g. fewer classes) due to spurious features. Counterintuitively, we demonstrate that spurious robust methods, designed to reduce spurious bias, fail to mitigate privacy disparity. Our analysis reveals that this occurs because robust methods can reduce reliance on spurious features for prediction, but do not prevent their memorization during training. Finally, we systematically compare the privacy of different model architectures trained with spurious data, demonstrating that, contrary to previous work, architectural choice can affect privacy evaluation.

[1025] Thought Purity: A Defense Framework For Chain-of-Thought Attack

Zihao Xue, Zhen Bi, Long Ma, Zhenlin Hu, Yan Wang, Zhenfang Liu, Qing Sheng, Jie Xiao, Jungang Lou

Main category: cs.LG

TL;DR: Thought Purity (TP) is a defense framework that protects reinforcement learning-trained Large Reasoning Models from Chain-of-Thought Attack vulnerabilities while maintaining reasoning performance.

DetailsMotivation: Reinforcement learning-trained Large Reasoning Models are vulnerable to security threats, particularly Chain-of-Thought Attack (CoTA) that exploits prompt controllability to degrade both reasoning safety and task performance.

Method: Three synergistic components: (1) safety-optimized data processing pipeline, (2) reinforcement learning-enhanced rule constraints, and (3) adaptive monitoring metrics.

Result: Establishes the first comprehensive defense mechanism against CoTA vulnerabilities in reinforcement learning-aligned reasoning systems.

Conclusion: Significantly advances the security-functionality equilibrium for next-generation AI architectures by strengthening resistance to malicious content while preserving operational efficacy.

Abstract: While reinforcement learning-trained Large Reasoning Models (LRMs, e.g., Deepseek-R1) demonstrate advanced reasoning capabilities in the evolving Large Language Models (LLMs) domain, their susceptibility to security threats remains a critical vulnerability. This weakness is particularly evident in Chain-of-Thought (CoT) generation processes, where adversarial methods like backdoor prompt attacks can systematically subvert the model’s core reasoning mechanisms. The emerging Chain-of-Thought Attack (CoTA) reveals this vulnerability through exploiting prompt controllability, simultaneously degrading both CoT safety and task performance with low-cost interventions. To address this compounded security-performance vulnerability, we propose Thought Purity (TP): a defense framework that systematically strengthens resistance to malicious content while preserving operational efficacy. Our solution achieves this through three synergistic components: (1) a safety-optimized data processing pipeline (2) reinforcement learning-enhanced rule constraints (3) adaptive monitoring metrics. Our approach establishes the first comprehensive defense mechanism against CoTA vulnerabilities in reinforcement learning-aligned reasoning systems, significantly advancing the security-functionality equilibrium for next-generation AI architectures.

[1026] LUMION: Fast Fault Recovery for ML Jobs Using Programmable Optical Fabrics

Abhishek Vijaya Kumar, Eric Ding, Arjun Devraj, Darius Bunandar, Rachee Singh

Main category: cs.LG

TL;DR: LUMION is a reconfigurable optical fabric that enables dynamic integration of spare accelerators into ongoing ML workloads when failures occur, eliminating the need for costly job migrations and improving resource efficiency in datacenters.

DetailsMotivation: Current approach of migrating entire ML jobs to new racks when accelerators fail is highly inefficient, requiring datacenters to reserve full racks of idle accelerators for fault tolerance.

Method: LUMION uses a novel reconfigurable optical fabric to connect accelerators within a datacenter rack, allowing dynamic integration of spare accelerators into ongoing workloads when failures occur.

Result: LUMION can swap a failed GPU with a healthy one and restart ML jobs within ~1 second, achieves higher inter-GPU bandwidth than traditional electrical racks, and provides nearly 2X improvement in fine-tuning throughput.

Conclusion: LUMION effectively addresses resource inefficiency in ML datacenters by enabling dynamic accelerator replacement without costly migrations, maintaining performance while improving fault tolerance.

Abstract: When accelerators fail in modern ML datacenters, operators migrate the affected ML training or inference jobs to entirely new racks. This approach, while preserving network performance, is highly inefficient, requiring datacenters to reserve full racks of idle accelerators for fault tolerance. In this paper, we address this resource inefficiency by introducing LUMION, a novel reconfigurable optical fabric for connecting accelerators within a datacenter rack. Instead of migrating entire ML jobs, LUMION dynamically integrates spare accelerators into ongoing workloads as failures occur, thereby maintaining consistent performance without costly migrations. We show the benefits of LUMION by building an end-to-end hardware prototype. Our experiments fine-tune Llama 3.2 and show that LUMION swaps a failed GPU with a healthy one and restarts the ML job within ~ 1 second of the failure. LUMION achieves higher inter-GPU bandwidth compared to traditional electrical racks after replacing failed accelerators with spare ones, leading to nearly 2X improvement in fine-tuning throughput.

[1027] Solar Photovoltaic Assessment with Large Language Model

Muhao Guo, Yang Weng

Main category: cs.LG

TL;DR: PVAL framework uses LLMs with task decomposition, output standardization, few-shot prompting, and fine-tuning to improve solar panel detection in satellite imagery, addressing transparency and generalization issues of existing methods.

DetailsMotivation: Existing solar PV detection methods lack transparency, require large training datasets, and struggle with generalization to new regions and conditions, hindering large-scale deployment for grid optimization.

Method: Proposed PVAL framework incorporates task decomposition for efficient workflows, output standardization for consistency, few-shot prompting for better classification, and fine-tuning with curated PV datasets.

Result: PVAL enables automated, reproducible solar panel detection with improved accuracy, transparency, scalability, and adaptability across heterogeneous datasets while minimizing computational overhead.

Conclusion: PVAL establishes a robust pipeline for solar panel detection that supports large-scale renewable energy integration and optimized grid management through open-source accessibility and transparent methodologies.

Abstract: Accurate detection and localization of solar photovoltaic (PV) panels in satellite imagery is essential for optimizing microgrids and active distribution networks (ADNs), which are critical components of renewable energy systems. Existing methods lack transparency regarding their underlying algorithms or training datasets, rely on large, high-quality PV training data, and struggle to generalize to new geographic regions or varied environmental conditions without extensive re-training. These limitations lead to inconsistent detection outcomes, hindering large-scale deployment and data-driven grid optimization. In this paper, we investigate how large language models (LLMs) can be leveraged to overcome these challenges. Despite their promise, LLMs face several challenges in solar panel detection, including difficulties with multi-step logical processes, inconsistent output formatting, frequent misclassification of visually similar objects (e.g., shadows, parking lots), and low accuracy in complex tasks such as spatial localization and quantification. To overcome these issues, we propose the PV Assessment with LLMs (PVAL) framework, which incorporates task decomposition for more efficient workflows, output standardization for consistent and scalable formatting, few-shot prompting to enhance classification accuracy, and fine-tuning using curated PV datasets with detailed annotations. PVAL ensures transparency, scalability, and adaptability across heterogeneous datasets while minimizing computational overhead. By combining open-source accessibility with robust methodologies, PVAL establishes an automated and reproducible pipeline for solar panel detection, paving the way for large-scale renewable energy integration and optimized grid management.

[1028] Inference-time Scaling of Diffusion Models through Classical Search

Xiangcheng Zhang, Haowei Lin, Haotian Ye, James Zou, Jianzhu Ma, Yitao Liang, Yilun Du

Main category: cs.LG

TL;DR: A framework using classical search algorithms for inference-time control in diffusion models, combining local search via annealed Langevin MCMC with global exploration using breadth-first and depth-first tree search.

DetailsMotivation: To address the challenge of inference-time control in diffusion models by adapting generated outputs to meet diverse test-time objectives using principles from classical search algorithms.

Method: Proposes a general framework that orchestrates local and global search: local search via annealed Langevin MCMC and global exploration using breadth-first and depth-first tree search.

Result: Significant gains in both performance and efficiency across challenging domains including planning, offline reinforcement learning, and image generation.

Conclusion: Classical search provides a principled and practical foundation for inference-time scaling in diffusion models.

Abstract: Classical search algorithms have long underpinned modern artificial intelligence. In this work, we tackle the challenge of inference-time control in diffusion models – adapting generated outputs to meet diverse test-time objectives – using principles from classical search. We propose a general framework that orchestrates local and global search to efficiently navigate the generative space. It employs a theoretically grounded local search via annealed Langevin MCMC and performs compute-efficient global exploration using breadth-first and depth-first tree search. We evaluate our approach on a range of challenging domains, including planning, offline reinforcement learning, and image generation. Across all tasks, we observe significant gains in both performance and efficiency. These results show that classical search provides a principled and practical foundation for inference-time scaling in diffusion models. Project page at https://diffusion-inference-scaling.github.io/.

[1029] First Hallucination Tokens Are Different from Conditional Ones

Jakob Snel, Seong Joon Oh

Main category: cs.LG

TL;DR: The paper finds that the first hallucinated token in LLM outputs is significantly more detectable than subsequent hallucinated tokens, revealing a structural pattern in token-level hallucination detection.

DetailsMotivation: To understand how hallucination signals are distributed across sequences of hallucinated tokens in LLMs, enabling more fine-grained detection and intervention.

Method: Leveraged token-level annotations from the RAGTruth corpus to analyze hallucination detection patterns across different models.

Result: Discovered that first hallucinated tokens are far more detectable than later ones, and this structural property holds consistently across different LLM models.

Conclusion: First hallucination tokens play a crucial role in token-level hallucination detection, providing insights for developing more effective detection methods.

Abstract: Large Language Models (LLMs) hallucinate, and detecting these cases is key to ensuring trust. While many approaches address hallucination detection at the response or span level, recent work explores token-level detection, enabling more fine-grained intervention. However, the distribution of hallucination signal across sequences of hallucinated tokens remains unexplored. We leverage token-level annotations from the RAGTruth corpus and find that the first hallucinated token is far more detectable than later ones. This structural property holds across models, suggesting that first hallucination tokens play a key role in token-level hallucination detection. Our code is available at https://github.com/jakobsnl/RAGTruth_Xtended.

[1030] Cascading Adversarial Bias from Injection to Distillation in Language Models

Harsh Chaudhari, Jamie Hayes, Matthew Jagielski, Ilia Shumailov, Milad Nasr, Alina Oprea

Main category: cs.LG

TL;DR: Distilled language models are vulnerable to adversarial bias injection during training, where minimal data poisoning of teacher models propagates and amplifies in student models, with current defenses proving ineffective.

DetailsMotivation: To investigate security vulnerabilities in model distillation where adversaries can inject subtle biases into teacher models that propagate and amplify in student models, raising concerns about resilience to adversarial manipulation in widely deployed systems.

Method: Proposed two propagation modes: Untargeted Propagation (affecting multiple tasks) and Targeted Propagation (focusing on specific tasks while maintaining normal behavior elsewhere). Evaluated across six bias types, various distillation methods, and different modalities including text and code generation.

Result: With only 25 poisoned samples (0.25% poisoning rate), student models generated biased responses 76.9% of the time in targeted scenarios (higher than 69.4% in teacher models). For untargeted propagation, adversarial bias appeared 6x-29x more frequently in student models on unseen tasks. Current defenses (perplexity filtering, bias detection systems, LLM-based autorater frameworks) were ineffective.

Conclusion: Results expose significant security vulnerabilities in distilled models, highlighting the need for specialized safeguards and proposing practical design principles for building effective adversarial bias mitigation strategies.

Abstract: Model distillation has become essential for creating smaller, deployable language models that retain larger system capabilities. However, widespread deployment raises concerns about resilience to adversarial manipulation. This paper investigates vulnerability of distilled models to adversarial injection of biased content during training. We demonstrate that adversaries can inject subtle biases into teacher models through minimal data poisoning, which propagates to student models and becomes significantly amplified. We propose two propagation modes: Untargeted Propagation, where bias affects multiple tasks, and Targeted Propagation, focusing on specific tasks while maintaining normal behavior elsewhere. With only 25 poisoned samples (0.25% poisoning rate), student models generate biased responses 76.9% of the time in targeted scenarios - higher than 69.4% in teacher models. For untargeted propagation, adversarial bias appears 6x-29x more frequently in student models on unseen tasks. We validate findings across six bias types (targeted advertisements, phishing links, narrative manipulations, insecure coding practices), various distillation methods, and different modalities spanning text and code generation. Our evaluation reveals shortcomings in current defenses - perplexity filtering, bias detection systems, and LLM-based autorater frameworks - against these attacks. Results expose significant security vulnerabilities in distilled models, highlighting need for specialized safeguards. We propose practical design principles for building effective adversarial bias mitigation strategies.

[1031] Synaptic Pruning: A Biological Inspiration for Deep Learning Regularization

Gideon Vos, Liza van Eijk, Zoltan Sarnyai, Mostafa Rahimi Azghadi

Main category: cs.LG

TL;DR: A magnitude-based synaptic pruning method that progressively removes low-importance connections during training, replacing dropout regularization and achieving better performance in time series forecasting.

DetailsMotivation: To create a more biologically-inspired regularization method that mimics synaptic pruning in biological brains, rather than using random dropout which doesn't consider activity-dependent pruning.

Method: Computes weight importance from absolute magnitudes across layers, applies a cubic schedule to gradually increase global sparsity, and permanently removes low-importance weights at fixed intervals while maintaining gradient flow for active weights.

Result: Ranked best overall in experiments across multiple time series forecasting models (RNN, LSTM, Patch Time Series Transformer) on four datasets, with statistically significant improvements (p < 0.01). Reduced Mean Absolute Error by up to 20% over models with no or standard dropout, and up to 52% in select transformer models for financial forecasting.

Conclusion: The dynamic pruning mechanism advances regularization by coupling weight elimination with progressive sparsification, offering easy integration into diverse architectures and serving as a practical alternative to conventional dropout techniques, especially effective in financial time series forecasting.

Abstract: Synaptic pruning in biological brains removes weak connections to improve efficiency. In contrast, dropout regularization in artificial neural networks randomly deactivates neurons without considering activity-dependent pruning. We propose a magnitude-based synaptic pruning method that better reflects biology by progressively removing low-importance connections during training. Integrated directly into the training loop as a dropout replacement, our approach computes weight importance from absolute magnitudes across layers and applies a cubic schedule to gradually increase global sparsity. At fixed intervals, pruning masks permanently remove low-importance weights while maintaining gradient flow for active ones, eliminating the need for separate pruning and fine-tuning phases. Experiments on multiple time series forecasting models including RNN, LSTM, and Patch Time Series Transformer across four datasets show consistent gains. Our method ranked best overall, with statistically significant improvements confirmed by Friedman tests (p < 0.01). In financial forecasting, it reduced Mean Absolute Error by up to 20% over models with no or standard dropout, and up to 52% in select transformer models. This dynamic pruning mechanism advances regularization by coupling weight elimination with progressive sparsification, offering easy integration into diverse architectures. Its strong performance, especially in financial time series forecasting, highlights its potential as a practical alternative to conventional dropout techniques.

[1032] Learning Semantics, Not Addresses: Runtime Neural Prefetching for Far Memory

Yutong Huang, Zhiyuan Guo, Yiying Zhang

Main category: cs.LG

TL;DR: FarSight is a Linux-based far-memory system that uses deep learning for memory prefetching by separating application semantics from runtime memory layout, achieving up to 3.6x performance improvement over state-of-the-art methods.

DetailsMotivation: Memory prefetching is crucial for far-memory systems where large memory portions are offloaded to remote tiers, but existing ML approaches have been limited to simulation or small-scale hardware.

Method: Decouples application semantics from runtime memory layout, enabling offline-trained deep learning models to predict access patterns over a compact ordinal vocabulary, with lightweight mappings resolving predictions at runtime.

Result: Across four data-intensive workloads, FarSight delivers up to 3.6x higher performance than state-of-the-art approaches.

Conclusion: FarSight successfully demonstrates the practical application of deep learning for memory prefetching in far-memory systems through semantic-layout decoupling, achieving significant performance gains.

Abstract: Memory prefetching has long boosted CPU caches and is increasingly vital for far-memory systems, where large portions of memory are offloaded to cheaper, remote tiers. While effective prefetching requires accurate prediction of future accesses, prior ML approaches have been limited to simulation or small-scale hardware. We introduce FarSight, the first Linux-based far-memory system to leverage deep learning by decoupling application semantics from runtime memory layout. This separation enables offline-trained models to predict access patterns over a compact ordinal vocabulary, which are resolved at runtime through lightweight mappings. Across four data-intensive workloads, FarSight delivers up to 3.6x higher performance than the state-of-the-art.

[1033] Reliably Detecting Model Failures in Deployment Without Labels

Viet Nguyen, Changjian Shui, Vijay Giri, Siddarth Arya, Michael Cooper, Amol Verma, Fahad Razak, Rahul G. Krishnan

Main category: cs.LG

TL;DR: D3M is a monitoring algorithm that detects when machine learning models need retraining due to data distribution shifts, using model disagreement to distinguish between performance-degrading and non-degrading shifts.

DetailsMotivation: Models in dynamic environments face changing data distributions, but current approaches lack reliable methods to detect when shifts actually degrade performance without access to labels during deployment.

Method: Proposes D3M algorithm based on predictive model disagreement, which monitors post-deployment deterioration by analyzing how models disagree on predictions when data distributions shift.

Result: D3M achieves low false positive rates for non-deteriorating shifts and provides theoretical sample complexity bounds for high true positive rates under deteriorating shifts. Empirical validation on benchmarks and real-world medical data shows effectiveness.

Conclusion: D3M provides a viable alert mechanism for high-stakes ML pipelines, enabling practical detection of performance-degrading data shifts without requiring labels during deployment.

Abstract: The distribution of data changes over time; models operating operating in dynamic environments need retraining. But knowing when to retrain, without access to labels, is an open challenge since some, but not all shifts degrade model performance. This paper formalizes and addresses the problem of post-deployment deterioration (PDD) monitoring. We propose D3M, a practical and efficient monitoring algorithm based on the disagreement of predictive models, achieving low false positive rates under non-deteriorating shifts and provides sample complexity bounds for high true positive rates under deteriorating shifts. Empirical results on both standard benchmark and a real-world large-scale internal medicine dataset demonstrate the effectiveness of the framework and highlight its viability as an alert mechanism for high-stakes machine learning pipelines.

[1034] Depth-Breadth Synergy in RLVR: Unlocking LLM Reasoning Gains with Adaptive Exploration

Zhicheng Yang, Zhijiang Guo, Yinya Huang, Yongxin Wang, Dongchun Xie, Yiwei Wang, Xiaodan Liang, Jing Tang

Main category: cs.LG

TL;DR: RLVR faces depth and breadth limitations. DARS addresses depth neglect by re-weighting hard problems through multi-stage rollouts, while large-breadth training enhances Pass@1 by scaling batch size and using full-batch updates. DARS-B combines both approaches for simultaneous gains.

DetailsMotivation: Current RLVR methods have unexplored potential in depth (hardest problems models can solve) and breadth (training data volume per iteration), with GRPO algorithm showing systematic bias against low-accuracy instances crucial for reasoning advancement.

Method: 1) DARS: Difficulty Adaptive Rollout Sampling that re-weights hard problems through targeted multi-stage rollouts. 2) Large-breadth training: Scaling batch size and replacing PPO’s mini-batch iterations with full-batch updates over multiple epochs. 3) DARS-B: Combining DARS with large-breadth training.

Result: DARS delivers consistent Pass@K gains without extra inference cost. Large-breadth training significantly enhances Pass@1 performance and sustains high token-level entropy. DARS-B shows simultaneous gains in both Pass@K and Pass@1.

Conclusion: Breadth and adaptive exploration across depth operate as orthogonal dimensions in RLVR, and their combination is key to unleashing the full reasoning power of RLVR.

Abstract: Reinforcement Learning with Verifiable Reward (RLVR) has emerged as a powerful paradigm for unlocking reasoning capabilities in large language models, yet its full potential is hindered by two under-explored dimensions: Depth-the hardest problem a model can sample; Breadth-the number of instances consumed in a single iteration. We dissect the popular GRPO algorithm and reveal a systematic bias: the cumulative-advantage disproportionately weights samples with medium accuracy, while down-weighting the low-accuracy instances that are crucial for pushing reasoning boundaries. To rectify the depth neglect, we introduce Difficulty Adaptive Rollout Sampling (DARS), which re-weights hard problems through targeted multi-stage rollouts, thereby increasing the number of positive rollouts for hard problems. Empirically, naively enlarging rollout size only accelerates convergence and even hurts Pass@K. Our DARS, in contrast, delivers consistent Pass@K gains without extra inference cost at convergence. Just as we adaptively expanded the depth of exploration, we now ask whether aggressively scaling the breadth of training data can further amplify reasoning gains. To this end, we intensely scale batch size and replace PPO’s mini-batch iterations with full-batch updates over multiple epochs. Increasing breadth significantly enhances Pass@1 performance. Large-breadth training sustains high token-level entropy, indicating continued exploration and reduced gradient noise. We further present DARS-B, which augments DARS with large breadth, and demonstrate simultaneous gains in Pass@K and Pass@1. The results confirm that breadth and adaptive exploration across depth operate as orthogonal dimensions in RLVR, which are key to unleashing the reasoning power of RLVR.

[1035] Measurement-Aligned Sampling for Inverse Problem

Shaorong Zhang, Rob Brekelmans, Yunshu Wu, Greg Ver Steeg

Main category: cs.LG

TL;DR: Proposes Measurement-Aligned Sampling (MAS), a novel framework for linear inverse problems that better balances prior information from diffusion models with measurement consistency, handling both Gaussian and non-Gaussian noise.

DetailsMotivation: Existing diffusion-based inverse problem methods struggle with conflicting signals between prior and measurement information, especially with non-Gaussian or unknown noise, leading to poor measurement consistency.

Method: MAS framework that unifies and extends approaches like DDNM and TMPD, flexibly balancing prior and measurement information while generalizing to handle various noise types including known Gaussian and unknown/non-Gaussian noise.

Result: Extensive experiments show MAS consistently outperforms state-of-the-art methods across various tasks while maintaining relatively low computational cost.

Conclusion: MAS provides an effective solution for linear inverse problems that better handles noise uncertainty and measurement consistency compared to existing diffusion-based approaches.

Abstract: Diffusion models provide a powerful way to incorporate complex prior information for solving inverse problems. However, existing methods struggle to correctly incorporate guidance from conflicting signals in the prior and measurement, and often failed to maximizing the consistency to the measurement, especially in the challenging setting of non-Gaussian or unknown noise. To address these issues, we propose Measurement-Aligned Sampling (MAS), a novel framework for linear inverse problem solving that flexibly balances prior and measurement information. MAS unifies and extends existing approaches such as DDNM, TMPD, while generalizing to handle both known Gaussian noise and unknown or non-Gaussian noise types. Extensive experiments demonstrate that MAS consistently outperforms state-of-the-art methods across a variety of tasks, while maintaining relatively low computational cost.

[1036] On Zero-Shot Reinforcement Learning

Scott Jeen

Main category: cs.LG

TL;DR: This thesis addresses zero-shot reinforcement learning challenges in real-world settings where data simulation is expensive, focusing on three constraints: data quality, observability, and data availability.

DetailsMotivation: RL systems excel in simulated environments but struggle in real-world deployment due to simulator-reality misalignment. Zero-shot RL aims to generalize to new tasks without practice, but existing methods fail under real-world constraints.

Method: Proposes a suite of methods for zero-shot RL that address three key constraints: handling small/homogeneous datasets (data quality), dealing with partial observability (observability), and operating without prior data access (data availability).

Result: Empirical studies demonstrate the limitations of existing methods and validate the proposed techniques for overcoming real-world constraints in zero-shot RL.

Conclusion: The proposed methods advance zero-shot RL towards practical real-world deployment by addressing critical constraints that existing approaches fail to handle effectively.

Abstract: Modern reinforcement learning (RL) systems capture deep truths about general, human problem-solving. In domains where new data can be simulated cheaply, these systems uncover sequential decision-making policies that far exceed the ability of any human. Society faces many problems whose solutions require this skill, but they are often in domains where new data cannot be cheaply simulated. In such scenarios, we can learn simulators from existing data, but these will only ever be approximately correct, and can be pathologically incorrect when queried outside of their training distribution. As a result, a misalignment between the environments in which we train our agents and the real-world in which we wish to deploy our agents is inevitable. Dealing with this misalignment is the primary concern of zero-shot reinforcement learning, a problem setting where the agent must generalise to a new task or domain with zero practice shots. Whilst impressive progress has been made on methods that perform zero-shot RL in idealised settings, new work is needed if these results are to be replicated in real-world settings. In this thesis, we argue that doing so requires us to navigate (at least) three constraints. First, the data quality constraint: real-world datasets are small and homogeneous. Second, the observability constraint: states, dynamics and rewards in the real-world are often only partially observed. And third, the data availability constraint: a priori access to data cannot always be assumed. This work proposes a suite of methods that perform zero-shot RL subject to these constraints. In a series of empirical studies we expose the failings of existing methods, and justify our techniques for remedying them. We believe these designs take us a step closer to RL methods that can be deployed to solve real-world problems.

[1037] Curating art exhibitions using machine learning

Eurico Covas

Main category: cs.LG

TL;DR: Four AI models were developed to learn from human-curated exhibitions, with three successfully imitating curators at varying levels of precision and coherence, showing that modest models can approach GPT-level performance through careful design.

DetailsMotivation: To explore whether AI can learn curatorial expertise from existing human-curated exhibitions and replicate similar curation work.

Method: Developed four machine learning models that learn from expert-curated exhibitions, using feature engineering and carefully designed modest-sized architectures.

Result: Three out of four models achieved reasonable ability to imitate human curators with various degrees of precision and curatorial coherence, performing well above random chance.

Conclusion: Two key insights: sufficient information exists in exhibitions to build AI models that replicate past exhibitions accurately, and carefully designed modest models can approach the performance of large language models like GPT.

Abstract: Here we present a series of artificial models - a total of four related models - based on machine learning techniques that attempt to learn from existing exhibitions which have been curated by human experts, in order to be able to do similar curatorship work. Out of our four artificial intelligence models, three achieve a reasonable ability at imitating these various curators responsible for all those exhibitions, with various degrees of precision and curatorial coherence. In particular, we can conclude two key insights: first, that there is sufficient information in these exhibitions to construct an artificial intelligence model that replicates past exhibitions with an accuracy well above random choices; and second, that using feature engineering and carefully designing the architecture of modest size models can make them almost as good as those using the so-called large language models such as GPT in a brute force approach.

[1038] Grokking in LLM Pretraining? Monitor Memorization-to-Generalization without Test

Ziyue Li, Chenrui Fan, Tianyi Zhou

Main category: cs.LG

TL;DR: This paper presents the first study of grokking in practical LLM pretraining, showing that grokking emerges in mixture-of-experts (MoE) LLMs during one-epoch pretraining on large-scale corpora, with different data groups entering grokking stages asynchronously.

DetailsMotivation: To investigate when LLMs memorize training data vs. generalize on downstream tasks, and what happens when there's a lag between the two, in practical LLM pretraining settings rather than limited algorithmic tasks.

Method: Analyzed pathway dynamics in MoE LLMs during one-epoch pretraining on cross-domain corpora, developing two novel metrics to quantify pathway similarity between samples and expert consistency across layers.

Result: Found that pathways evolve from random and instance-specific to structured and transferable despite converged pretraining loss, indicating transition from memorization to generalization. The metrics faithfully track downstream generalization without costly evaluation.

Conclusion: Grokking occurs in practical LLM pretraining, with pathway dynamics providing mechanistic interpretation of local grokking, and the proposed metrics offer zero-cost monitoring of generalization on downstream tasks.

Abstract: This paper presents the first study of grokking in practical LLM pretraining. Specifically, we investigate when an LLM memorizes the training data, when its generalization on downstream tasks starts to improve, and what happens if there is a lag between the two. Unlike existing works studying when a small model generalizes to limited and specified tasks during thousands epochs’ training on algorithmic data, we focus on a practical setting for LLMs, i.e., one-epoch pretraining of next-token prediction on a cross-domain, large-scale corpus, and generalization on diverse benchmark tasks covering math/commonsense reasoning, code generation, and domain-specific retrieval. Our study, for the first time, verifies that grokking still emerges in pretraining mixture-of-experts (MoE) LLMs, though different local data groups may enter their grokking stages asynchronously due to the heterogeneity of their distributions and attributions to others. To find a mechanistic interpretation of this local grokking, we investigate the dynamics of training data’s pathways (i.e., expert choices across layers in MoE). Our primary discovery is that the pathways evolve from random, non-smooth across layers, instance-specific to more structured and transferable across samples, despite the converged pretraining loss. This depicts a transition from memorization to generalization. Two novel metrics are developed to quantify these patterns: one computes the pathway similarity between samples, while the other measures the consistency of aggregated experts between subsequent layers for each sample. These training data based metrics induce zero cost but can faithfully track and monitor the generalization of LLMs on downstream tasks, which, in conventional settings, requires costly instruction tuning and benchmark evaluation.

[1039] Sampling-aware Adversarial Attacks Against Large Language Models

Tim Beyer, Yan Scholten, Leo Schwinn, Stephan Günnemann

Main category: cs.LG

TL;DR: This paper shows that incorporating repeated sampling during adversarial attacks significantly improves success rates and efficiency for eliciting harmful responses from LLMs, revealing that many existing optimization strategies have limited effect on output harmfulness.

DetailsMotivation: Existing adversarial attacks overlook the stochastic nature of LLMs and overestimate robustness by targeting single-point greedy generations, failing to accurately assess LLM safety at scale.

Method: The authors cast attacks as a resource allocation problem between optimization and sampling, determine compute-optimal trade-offs, integrate sampling into existing attacks, and introduce a label-free objective based on entropy maximization.

Result: Integrating sampling boosts attack success rates by up to 37% and improves efficiency by up to two orders of magnitude. Analysis shows many common optimization strategies have little effect on output harmfulness.

Conclusion: Sampling is crucial in attacks to accurately assess and strengthen LLM safety at scale, and the sampling-aware perspective enables new optimization targets.

Abstract: To guarantee safe and robust deployment of large language models (LLMs) at scale, it is critical to accurately assess their adversarial robustness. Existing adversarial attacks typically target harmful responses in single-point greedy generations, overlooking the inherently stochastic nature of LLMs and overestimating robustness. We show that for the goal of eliciting harmful responses, repeated sampling of model outputs during the attack complements prompt optimization and serves as a strong and efficient attack vector. By casting attacks as a resource allocation problem between optimization and sampling, we determine compute-optimal trade-offs and show that integrating sampling into existing attacks boosts success rates by up to 37% and improves efficiency by up to two orders of magnitude. We further analyze how distributions of output harmfulness evolve during an adversarial attack, discovering that many common optimization strategies have little effect on output harmfulness. Finally, we introduce a label-free proof-of-concept objective based on entropy maximization, demonstrating how our sampling-aware perspective enables new optimization targets. Overall, our findings establish the importance of sampling in attacks to accurately assess and strengthen LLM safety at scale.

[1040] Inference-Time Scaling of Diffusion Language Models with Particle Gibbs Sampling

Meihua Dang, Jiaqi Han, Minkai Xu, Kai Xu, Akash Srivastava, Stefano Ermon

Main category: cs.LG

TL;DR: PG-DLM introduces particle Gibbs sampling for diffusion language models, enabling trajectory-level refinement during inference to optimize rewards while maintaining generation quality, outperforming prior methods across various compute budgets.

DetailsMotivation: Discrete diffusion models match autoregressive models' performance but lack inference-time control methods. Existing approaches only optimize rewards step-by-step without trajectory-level refinement.

Method: PG-DLM constructs a Markov chain over full denoising trajectories and applies conditional sequential Monte Carlo kernel to resample them, providing theoretical convergence guarantees.

Result: Empirical evaluation shows PG-DLM consistently outperforms prior methods using MDLM and LLaDA-8B models across toxicity control, sentiment control, and linguistic acceptability tasks under various compute budgets.

Conclusion: Scaling iterations achieves the best reward-perplexity trade-off, and PG-DLM enables effective inference-time control without model retraining while preserving generation quality.

Abstract: Discrete diffusion models have recently emerged as strong alternatives to autoregressive language models, matching their performance through large-scale training. However, inference-time control remains relatively underexplored. In this work, we study how to steer generation toward desired rewards without retraining the models. Prior methods typically resample or filter within a single denoising trajectory, optimizing rewards step-by-step without trajectory-level refinement. We introduce particle Gibbs sampling for diffusion language models (PG-DLM), a novel inference-time algorithm enabling trajectory-level refinement while preserving generation perplexity under reward optimization. PG-DLM constructs a Markov chain over full denoising trajectories and applies a conditional sequential Monte Carlo kernel to resample them. We derive theoretical guarantees for convergence, including asymptotic consistency and variance bounds. Within this framework, we further analyze trade-offs across four key axes for inference-time scaling under fixed budgets: iterations, samples, denoising steps, and reward estimation. Our analysis shows scaling iterations achieves the best reward-perplexity trade-off. Empirically, PG-DLM consistently outperforms prior methods using MDLM and LLaDA-8B as base models across a wide range of compute budgets for reward-guided generation tasks including toxicity and sentiment control as well as linguistic acceptability.

[1041] Attention as an Adaptive Filter

Peter Racioppo

Main category: cs.LG

TL;DR: AFA is a novel attention mechanism that models input sequences as observations of a linear SDE, using closed-form solutions to efficiently compute attention weights that correspond to maximum likelihood solutions.

DetailsMotivation: To develop a more principled attention mechanism that incorporates temporal dynamics modeling directly into attention computation, moving beyond simple query-key comparisons.

Method: Model input sequences as discrete observations of a linear stochastic differential equation with simultaneously diagonalizable state matrices and noise covariances, using closed-form solutions to the differential Lyapunov equation to propagate pairwise uncertainties.

Result: Derives attention as the maximum likelihood solution for the linear SDE, with attention weights corresponding to robust residual-based reweightings of propagated pairwise precisions. Achieves same computational complexity as standard attention.

Conclusion: AFA provides a principled alternative to standard attention that incorporates dynamics modeling, with ordinary dot-product attention emerging as a special case under limiting conditions.

Abstract: We introduce Adaptive Filter Attention (AFA), a novel attention mechanism that incorporates a learnable dynamics model directly into the computation of attention weights. Rather than comparing queries and keys directly, we model the input sequence as discrete observations of a linear stochastic differential equation (SDE). By imposing a linear dynamics model with simultaneously diagonalizable state matrices and noise covariances, we can make use of a closed-form solution to the differential Lyapunov equation to efficiently propagate pairwise uncertainties through the dynamics. Attention naturally arises as the maximum likelihood solution for this linear SDE, with attention weights corresponding to robust residual-based reweightings of the propagated pairwise precisions. Imposing an additional constraint on the state matrix’s eigenvalues leads to a simplified variant with the same computational and memory complexity as standard attention. In the limit of vanishing dynamics and process noise, and using a small-angle approximation, we recover ordinary dot-product attention.

[1042] VITA: Variational Pretraining of Transformers for Climate-Robust Crop Yield Forecasting

Adib Hasan, Mardavij Roozbehani, Munther Dahleh

Main category: cs.LG

TL;DR: VITA is a variational pretraining framework that uses satellite weather data to improve crop yield forecasting, especially during extreme years, achieving state-of-the-art performance with less computational resources.

DetailsMotivation: Current AI models underperform when crop yields deviate from historical trends due to lack of physically grounded datasets linking atmospheric states to yields.

Method: VITA uses variational inference transformer with satellite-based weather data pretraining and transfers to ground-based measurements. It learns to predict latent atmospheric states under seasonality-aware sinusoidal prior and fine-tunes with limited weather statistics.

Result: Applied to 763 counties in U.S. Corn Belt, VITA achieves state-of-the-art performance in predicting corn and soybean yields across all scenarios, particularly during extreme years, with statistically significant improvements (p < 0.0001).

Conclusion: Domain-aware AI design like VITA can overcome data limitations and support resilient agricultural forecasting in changing climate, outperforming prior frameworks with less compute.

Abstract: Accurate crop yield forecasting is essential for global food security. However, current AI models systematically underperform when yields deviate from historical trends. We attribute this to the lack of rich, physically grounded datasets directly linking atmospheric states to yields. To address this, we introduce VITA (Variational Inference Transformer for Asymmetric data), a variational pretraining framework that learns representations from large satellite-based weather datasets and transfers to the ground-based limited measurements available for yield prediction. VITA is trained using detailed meteorological variables as proxy targets during pretraining and learns to predict latent atmospheric states under a seasonality-aware sinusoidal prior. This allows the model to be fine-tuned using limited weather statistics during deployment. Applied to 763 counties in the U.S. Corn Belt, VITA achieves state-of-the-art performance in predicting corn and soybean yields across all evaluation scenarios, particularly during extreme years, with statistically significant improvements (paired t-test, $p < 0.0001$). Importantly, VITA outperforms prior frameworks like GNN-RNN without soil data, and bigger foundational models (e.g., Chronos-Bolt) with less compute, making it practical for real-world use–especially in data-scarce regions. This work highlights how domain-aware AI design can overcome data limitations and support resilient agricultural forecasting in a changing climate.

[1043] Owen Sampling Accelerates Contribution Estimation in Federated Learning

Hossein KhademSohi, Hadi Hemmati, Jiayu Zhou, Steve Drew

Main category: cs.LG

TL;DR: FedOwen is an efficient framework that uses Owen sampling to approximate Shapley values for client contribution in federated learning, achieving higher accuracy with adaptive client selection.

DetailsMotivation: Accurately estimating client contributions in federated learning is crucial for fair rewards and faster convergence, but exact Shapley value computation is infeasible for large federations due to exponential scaling.

Method: Uses Owen sampling to approximate Shapley values efficiently under fixed evaluation budget, combined with adaptive client selection that balances exploitation of high-value clients and exploration of under-sampled ones.

Result: FedOwen achieves up to 23% higher final accuracy within the same number of communication rounds compared to state-of-the-art baselines on non-IID benchmarks under fixed valuation cost.

Conclusion: FedOwen provides an efficient and effective approach for client valuation in federated learning, improving model accuracy while maintaining computational feasibility.

Abstract: Federated Learning (FL) aggregates information from multiple clients to train a shared global model without exposing raw data. Accurately estimating each client’s contribution is essential not just for fair rewards, but for selecting the most useful clients so the global model converges faster. The Shapley value is a principled choice, yet exact computation scales exponentially with the number of clients, making it infeasible for large federations. We propose FedOwen, an efficient framework that uses Owen sampling to approximate Shapley values under the same total evaluation budget as existing methods while keeping the approximation error small. In addition, FedOwen uses an adaptive client selection strategy that balances exploiting high-value clients with exploring under-sampled ones, reducing bias and uncovering rare but informative data. Under a fixed valuation cost, FedOwen achieves up to 23 percent higher final accuracy within the same number of communication rounds compared to state-of-the-art baselines on non-IID benchmarks.

[1044] time2time: Causal Intervention in Hidden States to Simulate Rare Events in Time Series Foundation Models

Debdeep Sanyal, Aaryan Nagpal, Dhruv Kumar, Murari Mandal, Saurabh Deshpande

Main category: cs.LG

TL;DR: The paper introduces activation transplantation to investigate if transformer models internalize semantic concepts like market regimes and can simulate rare events like market crashes by manipulating hidden states.

DetailsMotivation: To determine if foundation models internalize semantic concepts beyond curve fitting and if their representations can simulate high-stakes events like market crashes.

Method: Activation transplantation - a causal intervention that manipulates hidden states by imposing statistical moments from one event onto another during forward pass.

Result: Models encode graded event severity with latent vector norm correlating with systemic shock magnitude. Validated across Toto and Chronos architectures.

Conclusion: Large time series transformers have steerable, semantically grounded representations with a latent concept space governing predictions, enabling semantic what-if analysis for stress-testing.

Abstract: While transformer-based foundation models excel at forecasting routine patterns, two questions remain: do they internalize semantic concepts such as market regimes, or merely fit curves? And can their internal representations be leveraged to simulate rare, high-stakes events such as market crashes? To investigate this, we introduce activation transplantation, a causal intervention that manipulates hidden states by imposing the statistical moments of one event (e.g., a historical crash) onto another (e.g., a calm period) during the forward pass. This procedure deterministically steers forecasts: injecting crash semantics induces downturn predictions, while injecting calm semantics suppresses crashes and restores stability. Beyond binary control, we find that models encode a graded notion of event severity, with the latent vector norm directly correlating with the magnitude of systemic shocks. Validated across two architecturally distinct TSFMs, Toto (decoder only) and Chronos (encoder-decoder), our results demonstrate that steerable, semantically grounded representations are a robust property of large time series transformers. Our findings provide evidence for a latent concept space that governs model predictions, shifting interpretability from post-hoc attribution to direct causal intervention, and enabling semantic “what-if” analysis for strategic stress-testing.

[1045] Physics-informed Value Learner for Offline Goal-Conditioned Reinforcement Learning

Vittorio Giammarino, Ruiqi Ni, Ahmed H. Qureshi

Main category: cs.LG

TL;DR: Proposes a physics-informed regularizer for offline goal-conditioned RL that improves value function learning by incorporating geometric inductive bias from the Eikonal PDE.

DetailsMotivation: Offline GCRL faces challenges with limited dataset coverage and long-horizon generalization. Current methods lack geometric structure in value functions.

Method: Physics-informed regularizer derived from Eikonal PDE, integrated with Hierarchical Implicit Q-Learning (HIQL) as Eik-HIQL.

Result: Significant improvements in performance and generalization, especially in stitching regimes and large-scale navigation tasks.

Conclusion: The Eikonal-regularized approach effectively incorporates geometric structure into value learning, enhancing offline GCRL capabilities.

Abstract: Offline Goal-Conditioned Reinforcement Learning (GCRL) holds great promise for domains such as autonomous navigation and locomotion, where collecting interactive data is costly and unsafe. However, it remains challenging in practice due to the need to learn from datasets with limited coverage of the state-action space and to generalize across long-horizon tasks. To improve on these challenges, we propose a \emph{Physics-informed (Pi)} regularized loss for value learning, derived from the Eikonal Partial Differential Equation (PDE) and which induces a geometric inductive bias in the learned value function. Unlike generic gradient penalties that are primarily used to stabilize training, our formulation is grounded in continuous-time optimal control and encourages value functions to align with cost-to-go structures. The proposed regularizer is broadly compatible with temporal-difference-based value learning and can be integrated into existing Offline GCRL algorithms. When combined with Hierarchical Implicit Q-Learning (HIQL), the resulting method, Eikonal-regularized HIQL (Eik-HIQL), yields significant improvements in both performance and generalization, with pronounced gains in stitching regimes and large-scale navigation tasks.

[1046] FRAUDGUESS: Spotting and Explaining New Types of Fraud in Million-Scale Financial Data

Robson L. F. Cordeiro, Meng-Chieh Lee, Christos Faloutsos

Main category: cs.LG

TL;DR: FRAUDGUESS is a system for detecting new types of financial fraud through micro-cluster analysis and providing justification via visualization tools, successfully identifying previously unknown fraudulent behaviors in real-world financial data.

DetailsMotivation: To address the challenge of detecting new types of financial fraud that are unknown to domain experts, while also providing evidence to support the findings.

Method: Uses micro-cluster analysis in a carefully designed feature space for detection, and visualization tools, heatmaps, and interactive dashboards for justification and deep dives.

Result: Successfully deployed in real life and identified three new fraudulent behaviors in a million-scale financial dataset, with two behaviors confirmed as fraudulent by experts, catching hundreds of previously undetected fraudulent transactions.

Conclusion: FRAUDGUESS effectively detects new fraud types and provides convincing justification, making it suitable for real-world deployment in financial institutions.

Abstract: Given a set of financial transactions (who buys from whom, when, and for how much), as well as prior information from buyers and sellers, how can we find fraudulent transactions? If we have labels for some transactions for known types of fraud, we can build a classifier. However, we also want to find new types of fraud, still unknown to the domain experts (‘Detection’). Moreover, we also want to provide evidence to experts that supports our opinion (‘Justification’). In this paper, we propose FRAUDGUESS, to achieve two goals: (a) for ‘Detection’, it spots new types of fraud as micro-clusters in a carefully designed feature space; (b) for ‘Justification’, it uses visualization and heatmaps for evidence, as well as an interactive dashboard for deep dives. FRAUDGUESS is used in real life and is currently considered for deployment in an Anonymous Financial Institution (AFI). Thus, we also present the three new behaviors that FRAUDGUESS discovered in a real, million-scale financial dataset. Two of these behaviors are deemed fraudulent or suspicious by domain experts, catching hundreds of fraudulent transactions that would otherwise go un-noticed.

[1047] Reinforced Generation of Combinatorial Structures: Applications to Complexity Theory

Ansh Nagda, Prabhakar Raghavan, Abhradeep Thakurta

Main category: cs.LG

TL;DR: AI techniques (AlphaEvolve LLM agent) used to discover new combinatorial structures, improving algorithmic limits for MAX-CUT and MAX-Independent Set problems, and obtaining new inapproximability results for MAX-k-CUT.

DetailsMotivation: To explore whether AI can help discover new combinatorial structures that improve known limits on efficient algorithms, specifically for graph optimization problems.

Method: Used AlphaEvolve (LLM coding agent) to: 1) construct nearly extremal Ramanujan graphs for improved lower bounds, 2) discover new gadget reductions for inapproximability proofs, and 3) evolve faster verification procedures for candidate constructions.

Result: Improved near-optimal bounds for MAX-CUT and MAX-Independent Set on random regular graphs, and new NP-hardness results: MAX-4-CUT inapproximable within 0.987 (improving 0.9883) and MAX-3-CUT within 0.9649 (improving gadget-based 0.9853). Verification speedups up to 10,000×.

Conclusion: AI (AlphaEvolve) successfully discovered new combinatorial structures and improved algorithmic limits, while also evolving faster verification procedures. The paper discusses norms for assessing AI assistance in proof development.

Abstract: We explore whether techniques from AI can help discover new combinatorial structures that improve on known limits on efficient algorithms. Specifically, we use AlphaEvolve (an LLM coding agent) to study two settings: a) Average-case hardness for MAX-CUT and MAX-Independent Set: We improve a recent result of Kunisky and Yu to obtain near-optimal upper and (conditional) lower bounds on certification algorithms for MAX-CUT and MAX-Independent Set on random 3- and 4-regular graphs. Our improved lower bounds are obtained by constructing nearly extremal Ramanujan graphs on as many as $163$ nodes, using AlphaEvolve. Additionally, via analytical arguments we strengthen the upper bounds to settle the computational hardness of these questions up to an error in the third decimal place. b) Worst-case Hardness of Approximation for MAX-k-CUT: We obtain new inapproximability results, proving that it is NP-hard to approximate MAX-4-CUT and MAX-3-CUT within factors of $0.987$ and $0.9649$ respectively, using AlphaEvolve to discover new gadget reductions. Our MAX-4-CUT result improves upon the SOTA of $0.9883$, and our MAX-3-CUT result improves on the current best gadget-based inapproximability result of $0.9853$, but falls short of improving the SOTA of $16/17$ that relies on a custom PCP, rather than a gadget reduction from “standard” H{\aa}stad-style PCPs. A key technical challenge we faced: verifying a candidate construction produced by AlphaEvolve is costly (often requiring exponential time). In both settings above, our results were enabled by using AlphaEvolve itself to evolve the verification procedure to be faster (sometimes by $10,000\times$). We conclude with a discussion of norms by which to assess the assistance from AI in developing proofs.

[1048] Analyzing Uncertainty Quantification in Statistical and Deep Learning Models for Probabilistic Electricity Price Forecasting

Andreas Lebedev, Abhinav Das, Sven Pappert, Stephan Schlüter

Main category: cs.LG

TL;DR: This study evaluates uncertainty quantification methods in probabilistic electricity price forecasting models for the German market, comparing deep distributional neural networks (DDNNs) with ensemble, MC dropout, and conformal prediction approaches against LASSO-estimated autoregressive (LEAR) models with quantile regression averaging and GARCH.

DetailsMotivation: Most probabilistic forecasting models fail to capture the full extent of uncertainty, which arises from data, model choices, and distributional assumptions. This study aims to address this gap by examining comprehensive uncertainty quantification in electricity price forecasting.

Method: The study compares DDNNs augmented with ensemble methods, MC dropout, and conformal prediction against LEAR models combined with quantile regression averaging, GARCH, and conformal prediction. Various uncertainty quantification techniques are applied to capture both data and model uncertainty.

Result: LEAR-based models perform well in probabilistic forecasting regardless of uncertainty quantification method. DDNNs benefit from incorporating both data and model uncertainty, improving both point and probabilistic forecasts. Conformal prediction best captures uncertainty, and all models perform competitively with relative performance depending on chosen metrics.

Conclusion: All considered models perform competitively in electricity price forecasting, with LEAR models showing strong probabilistic forecasting performance and DDNNs benefiting from comprehensive uncertainty quantification. The choice between models depends on specific forecasting metrics and requirements.

Abstract: Precise probabilistic forecasts are fundamental for energy risk management, and there is a wide range of both statistical and machine learning models for this purpose. Inherent to these probabilistic models is some form of uncertainty quantification. However, most models do not capture the full extent of uncertainty, which arises not only from the data itself but also from model and distributional choices. In this study, we examine uncertainty quantification in state-of-the-art statistical and deep learning probabilistic forecasting models for electricity price forecasting in the German market. In particular, we consider deep distributional neural networks (DDNNs) and augment them with an ensemble approach, Monte Carlo (MC) dropout, and conformal prediction to account for model uncertainty. Additionally, we consider the LASSO-estimated autoregressive (LEAR) approach combined with quantile regression averaging (QRA), generalized autoregressive conditional heteroskedasticity (GARCH), and conformal prediction. Across a range of performance metrics, we find that the LEAR-based models perform well in terms of probabilistic forecasting, irrespective of the uncertainty quantification method. Furthermore, we find that DDNNs benefit from incorporating both data and model uncertainty, improving both point and probabilistic forecasting. Uncertainty itself appears to be best captured by the models using conformal prediction. Overall, our extensive study shows that all models under consideration perform competitively. However, their relative performance depends on the choice of metrics for point and probabilistic forecasting.

[1049] When Judgment Becomes Noise: How Design Failures in LLM Judge Benchmarks Silently Undermine Validity

Benjamin Feuer, Chiung-Yi Tseng, Astitwa Sarthak Lathe, Oussama Elachqar, John P Dickerson

Main category: cs.LG

TL;DR: LLM-judged benchmarks have design flaws that can make rankings unreliable. The paper introduces two diagnostic tools to measure schema adherence and psychometric validity, revealing high noise levels in current benchmarks like Arena-Hard Auto.

DetailsMotivation: To address reliability issues in LLM-judged benchmarks where rankings can be largely noise due to poor design and lack of verifiable constructions.

Method: Introduced two diagnostic mechanisms: schematic adherence (measuring how much verdicts follow explicit schemas) and psychometric validity (quantifying irreducible uncertainty through internal consistency and discriminant validity).

Result: Found severe schema incoherence (90%+ unexplained variance for DeepSeek-R1-32B) and factor collapse (correlations >0.93 for most criteria) in Arena-Hard Auto, showing ELO-style aggregation masks ranking uncertainty.

Conclusion: Current LLM-judged benchmarks have design failures that undermine validity, and the paper provides principles for building better-scoped, reliability-aware benchmarks.

Abstract: LLM-judged benchmarks are increasingly used to evaluate complex model behaviors, yet their design introduces failure modes absent in conventional ground-truth based benchmarks. We argue that without tight objectives and verifiable constructions, benchmark rankings can produce high-confidence rankings that are in fact largely noise. We introduce two mechanisms to diagnose these issues. Schematic adherence quantifies how much of a judge’s overall verdict is explained by the explicit evaluation schema, revealing unexplained variance when judges deviate from their own rubric. Psychometric validity aggregates internal consistency and discriminant validity signals to quantify irreducible uncertainty in any benchmarking run. Applying these tools to Arena-Hard Auto, we find severe schema incoherence and factor collapse across popular judges: for example, unexplained variance exceeding 90 percent for DeepSeek-R1-32B and factor correlations above 0.93 for most criteria. We also show that the ELO-style aggregation used by Arena-Hard Auto collapses and masks genuine ranking uncertainty. Our results highlight design failures that undermine validity and offer actionable principles for building better-scoped, reliability-aware LLM-judged benchmarks. We released our code and dataset at https://github.com/penfever/judgment-to-noise

[1050] Frame-based Equivariant Diffusion Models for 3D Molecular Generation

Mohan Guo, Cong Liu, Patrick Forré

Main category: cs.LG

TL;DR: A frame-based diffusion paradigm for molecular generation that achieves deterministic E(3)-equivariance while decoupling symmetry handling from the backbone architecture, with three variants (GFD, LFD, IFD) and EdgeDiT transformer for enhanced expressivity.

DetailsMotivation: To address the trade-off between strict equivariance with costly architectures versus relaxed equivariance for scalability and flexibility in molecular generation methods.

Method: Frame-based diffusion paradigm with three variants: Global Frame Diffusion (shared molecular frame), Local Frame Diffusion (node-specific frames with alignment constraints), and Invariant Frame Diffusion (pre-canonicalized invariant representations), plus EdgeDiT transformer with edge-aware attention.

Result: State-of-the-art performance on QM9 dataset with test NLL of -137.97 (standard scale) and -141.85 (double scale), 98.98% atom stability, 90.51% molecular stability, surpassing all equivariant baselines with high validity and uniqueness, and nearly 2x faster sampling than EDM.

Conclusion: Frame-based diffusion establishes a scalable, flexible, and physically grounded paradigm for molecular generation, highlighting the critical role of global structure preservation.

Abstract: Recent methods for molecular generation face a trade-off: they either enforce strict equivariance with costly architectures or relax it to gain scalability and flexibility. We propose a frame-based diffusion paradigm that achieves deterministic E(3)-equivariance while decoupling symmetry handling from the backbone. Building on this paradigm, we investigate three variants: Global Frame Diffusion (GFD), which assigns a shared molecular frame; Local Frame Diffusion (LFD), which constructs node-specific frames and benefits from additional alignment constraints; and Invariant Frame Diffusion (IFD), which relies on pre-canonicalized invariant representations. To enhance expressivity, we further utilize EdgeDiT, a Diffusion Transformer with edge-aware attention. On the QM9 dataset, GFD with EdgeDiT achieves state-of-the-art performance, with a test NLL of -137.97 at standard scale and -141.85 at double scale, alongside atom stability of 98.98%, and molecular stability of 90.51%. These results surpass all equivariant baselines while maintaining high validity and uniqueness and nearly 2x faster sampling compared to EDM. Altogether, our study establishes frame-based diffusion as a scalable, flexible, and physically grounded paradigm for molecular generation, highlighting the critical role of global structure preservation.

[1051] Active Attacks: Red-teaming LLMs via Adaptive Environments

Taeyoung Yun, Pierre-Luc St-Charles, Jinkyoo Park, Yoshua Bengio, Minsu Kim

Main category: cs.LG

TL;DR: Active Attacks is a novel RL-based red-teaming algorithm that adaptively generates diverse attack prompts for LLM safety testing by periodically fine-tuning the victim model, forcing the attacker to explore new vulnerabilities.

DetailsMotivation: Existing RL methods for generating attack prompts collapse to limited modes and fail to capture a wide range of harmful behaviors, requiring explicit diversity objectives.

Method: The approach uses reinforcement learning with a toxicity classifier as reward, and periodically safety fine-tunes the victim LLM with collected attack prompts, which diminishes rewards in exploited regions and forces exploration of new vulnerabilities.

Result: Active Attacks improved cross-attack success rates from 0.07% to 31.28% (400x relative gain) compared to previous state-of-the-art GFlowNets, with only 6% increase in computation.

Conclusion: Active Attacks effectively discovers diverse attack modes through adaptive exploration and creates an easy-to-hard curriculum, outperforming existing RL methods while being computationally efficient.

Abstract: We address the challenge of generating diverse attack prompts for large language models (LLMs) that elicit harmful behaviors (e.g., insults, sexual content) and are used for safety fine-tuning. Rather than relying on manual prompt engineering, attacker LLMs can be trained with reinforcement learning (RL) to automatically generate such prompts using only a toxicity classifier as a reward. However, capturing a wide range of harmful behaviors is a significant challenge that requires explicit diversity objectives. Existing diversity-seeking RL methods often collapse to limited modes: once high-reward prompts are found, exploration of new regions is discouraged. Inspired by the active learning paradigm that encourages adaptive exploration, we introduce \textit{Active Attacks}, a novel RL-based red-teaming algorithm that adapts its attacks as the victim evolves. By periodically safety fine-tuning the victim LLM with collected attack prompts, rewards in exploited regions diminish, which forces the attacker to seek unexplored vulnerabilities. This process naturally induces an easy-to-hard exploration curriculum, where the attacker progresses beyond easy modes toward increasingly difficult ones. As a result, Active Attacks uncovers a wide range of local attack modes step by step, and their combination achieves wide coverage of the multi-mode distribution. Active Attacks, a simple plug-and-play module that seamlessly integrates into existing RL objectives, unexpectedly outperformed prior RL-based methods – including GFlowNets, PPO, and REINFORCE – by improving cross-attack success rates against GFlowNets, the previous state-of-the-art, from 0.07% to 31.28% (a relative gain greater than $400\ \times$) with only a 6% increase in computation. Our code is publicly available \href{https://github.com/dbsxodud-11/active_attacks}{here}.

[1052] Boundary on the Table: Efficient Black-Box Decision-Based Attacks for Structured Data

Roie Kazoom, Yuval Ratzabi, Etamar Rothstein, Ofer Hadar

Main category: cs.LG

TL;DR: A novel black-box decision-based adversarial attack for tabular data that achieves over 90% success rates with minimal queries, exposing critical vulnerabilities in tabular models.

DetailsMotivation: Adversarial robustness in structured data is underexplored compared to vision and language domains, creating a need for specialized attacks on tabular data.

Method: Combines gradient-free direction estimation with iterative boundary search to efficiently navigate discrete and continuous feature spaces under minimal oracle access.

Result: Successfully compromises nearly entire test sets across diverse models (classical ML to LLM-based pipelines) with success rates consistently above 90% using only small number of queries per instance.

Conclusion: Highlights critical vulnerability of tabular models to adversarial perturbations and underscores urgent need for stronger defenses in real-world decision-making systems.

Abstract: Adversarial robustness in structured data remains an underexplored frontier compared to vision and language domains. In this work, we introduce a novel black-box, decision-based adversarial attack tailored for tabular data. Our approach combines gradient-free direction estimation with an iterative boundary search, enabling efficient navigation of discrete and continuous feature spaces under minimal oracle access. Extensive experiments demonstrate that our method successfully compromises nearly the entire test set across diverse models, ranging from classical machine learning classifiers to large language model (LLM)-based pipelines. Remarkably, the attack achieves success rates consistently above 90%, while requiring only a small number of queries per instance. These results highlight the critical vulnerability of tabular models to adversarial perturbations, underscoring the urgent need for stronger defenses in real-world decision-making systems.

[1053] Understanding Catastrophic Interference On the Identifibility of Latent Representations

Yuke Li, Yujia Zheng, Tianyi Xiong, Zhenyi Wang, Heng Huang

Main category: cs.LG

TL;DR: The paper proposes a novel theoretical framework that formulates catastrophic interference as an identification problem and introduces a two-stage training method to mitigate forgetting by identifying shared latent variables.

DetailsMotivation: To better understand and address catastrophic interference (catastrophic forgetting) in machine learning from a latent representation learning perspective.

Method: Proposes a two-stage training strategy: first uses maximum likelihood estimation to learn latent representations from partial-task aware (PTA) and all-task aware (ATA) setups, then optimizes KL divergence to identify shared latent variables.

Result: Theoretical analysis shows forgetting can be quantified by distance between PTA and ATA setups, and empirical validations demonstrate effective mitigation of catastrophic interference across synthetic and benchmark datasets.

Conclusion: Identifying and learning shared latent representations between task setups can effectively mitigate catastrophic interference, providing both theoretical guarantees and practical performance improvements.

Abstract: Catastrophic interference, also known as catastrophic forgetting, is a fundamental challenge in machine learning, where a trained learning model progressively loses performance on previously learned tasks when adapting to new ones. In this paper, we aim to better understand and model the catastrophic interference problem from a latent representation learning point of view, and propose a novel theoretical framework that formulates catastrophic interference as an identification problem. Our analysis demonstrates that the forgetting phenomenon can be quantified by the distance between partial-task aware (PTA) and all-task aware (ATA) setups. Building upon recent advances in identifiability theory, we prove that this distance can be minimized through identification of shared latent variables between these setups. When learning, we propose our method \ourmeos with two-stage training strategy: First, we employ maximum likelihood estimation to learn the latent representations from both PTA and ATA configurations. Subsequently, we optimize the KL divergence to identify and learn the shared latent variables. Through theoretical guarantee and empirical validations, we establish that identifying and learning these shared representations can effectively mitigate catastrophic interference in machine learning systems. Our approach provides both theoretical guarantees and practical performance improvements across both synthetic and benchmark datasets.

[1054] NanoFlux: Adversarial Dual-LLM Evaluation and Distillation For Multi-Domain Reasoning

Raviteja Anantha, Soheil Hor, Teodor Nicola Antoniu, Layne C. Price

Main category: cs.LG

TL;DR: NanoFlux is an adversarial framework that generates targeted training data (under 200 examples) to improve LLM reasoning, outperforming conventional fine-tuning with significant performance gains across multiple domains while reducing computational requirements by 3-14x.

DetailsMotivation: To improve LLM reasoning capabilities through intelligent synthesis of small, precisely targeted training datasets rather than large-scale conventional fine-tuning, addressing computational inefficiency and performance limitations.

Method: Uses competitive dynamics between models alternating as Attacker and Defender, supervised by a tool-augmented Judge. Generates multi-step questions with explanatory annotations targeting specific reasoning capabilities, with embedding-based novelty filtering and automated evaluation.

Result: Fine-tuning a 4B-parameter model on NanoFlux data achieved: +5.9% on GSMHard (math), +3.6% on GenomeBench (science), +16.6% on MultiMedQA (medical), with 3-14x computational reduction compared to full-benchmark fine-tuning.

Conclusion: Future model improvements may lie in intelligent synthesis of small, precisely targeted training datasets rather than large-scale data collection, with domain-specific optimal points for question complexity and reasoning quality.

Abstract: We present NanoFlux, a novel adversarial framework for generating targeted training data to improve LLM reasoning, where adversarially-generated datasets containing fewer than 200 examples outperform conventional fine-tuning approaches. The framework employs a competitive dynamic between models alternating as Attacker and Defender, supervised by a tool-augmented Judge, synthesizing multi-step questions with explanatory annotations that target specific reasoning capabilities. Fine-tuning a 4B-parameter model on NanoFlux-generated data yields performance gains across diverse domains compared to full-benchmark fine-tuning: +5.9% on mathematical reasoning (GSMHard), +3.6% on scientific reasoning (GenomeBench), and +16.6% on medical reasoning (MultiMedQA), while reducing computational requirements by 3-14x. Ablation studies reveal a non-monotonic relationship between dataset characteristics and model performance, uncovering domain-specific optimal points for question complexity and reasoning quality. NanoFlux automates training data generation through embedding-based novelty filtering, tool-augmented evaluation, and multi-hop reasoning, suggesting that future model improvements may lie in the intelligent synthesis of small, precisely targeted training datasets.

[1055] Emergence of Superposition: Unveiling the Training Dynamics of Chain of Continuous Thought

Hanlin Zhu, Shibo Hao, Zhiting Hu, Jiantao Jiao, Stuart Russell, Yuandong Tian

Main category: cs.LG

TL;DR: The paper analyzes how superposition mechanisms emerge during training in continuous chain-of-thought transformers for graph reachability problems, revealing a two-stage training process with bounded index-matching logits that balance exploration and exploitation.

DetailsMotivation: Previous work showed continuous CoT improves reasoning via implicit parallel thinking, but it was unclear how the superposition mechanism naturally emerges from gradient-based training methods.

Method: Theoretical analysis of training dynamics in a simplified two-layer transformer on directed graph reachability, examining two training stages: thought-generation (autoregressive expansion) and prediction (converting thought to answer).

Result: Analysis reveals that index-matching logit first increases then remains bounded, effectively balancing exploration and exploitation - exploiting local structures while assigning comparable weights to multiple traces when uncertain, creating superposition. Experimental results validate the theory.

Conclusion: The bounded index-matching logit during continuous CoT training naturally enables superposition by balancing local search exploitation with exploration of multiple reasoning traces, explaining how parallel thinking emerges from gradient-based training.

Abstract: Previous work shows that the chain of continuous thought (continuous CoT) improves the reasoning capability of large language models (LLMs) by enabling implicit parallel thinking, and a subsequent work provided theoretical insight by showing that a two-layer transformer equipped with continuous CoT can efficiently solve directed graph reachability by maintaining a superposition of multiple reasoning traces in the continuous thought. However, it remains unclear how the superposition mechanism is naturally learned from gradient-based training methods. To fill this gap, we theoretically analyze the training dynamics of a simplified two-layer transformer on the directed graph reachability problem to unveil how the superposition mechanism emerges during training in two training stages – (i) a thought-generation stage that autoregressively expands the continuous thought, and (ii) a prediction stage that converts the thought into the final answer. Our analysis reveals that during training using continuous thought, the index-matching logit, an important quantity which reflects the strength of the model’s local search ability, will first increase and then remain bounded under mild assumptions. The bounded index-matching logit effectively balances exploration and exploitation during the reasoning process: the model will exploit local problem structures to identify plausible search traces, and assign comparable weights to multiple such traces to explore when it is uncertain about which solution is correct, which results in superposition. Our experimental results tracking the growth of logits further validate our theory.

[1056] Uncertainty-Aware Generative Oversampling Using an Entropy-Guided Conditional Variational Autoencoder

Amirhossein Zare, Amirhessam Zare, Parmida Sadat Pezeshki, Herlock, Rahimi, Ali Ebrahimi, Ignacio Vázquez-García, Leo Anthony Celi

Main category: cs.LG

TL;DR: LEO-CVAE is a generative oversampling method that uses local entropy to guide data generation in uncertain boundary regions, improving performance on imbalanced biomedical datasets.

DetailsMotivation: Traditional oversampling methods produce implausible samples, while standard generative models neglect uncertain boundary examples that are crucial for effective learning in imbalanced scenarios.

Method: Proposes LEO-CVAE with two key components: Local Entropy-Weighted Loss to emphasize uncertain regions during training, and entropy-guided sampling to generate synthetic samples in class-overlapping boundary areas.

Result: Outperforms both traditional oversampling methods and generative baselines on clinical genomics datasets (ADNI and TCGA lung cancer), consistently improving classifier performance.

Conclusion: Uncertainty-aware generative oversampling is valuable for imbalanced learning in domains with complex nonlinear structures like omics data.

Abstract: Class imbalance remains a major challenge in machine learning, especially for high-dimensional biomedical data where nonlinear manifold structures dominate. Traditional oversampling methods such as SMOTE rely on local linear interpolation, often producing implausible synthetic samples. Deep generative models like Conditional Variational Autoencoders (CVAEs) better capture nonlinear distributions, but standard variants treat all minority samples equally, neglecting the importance of uncertain, boundary-region examples emphasized by heuristic methods like Borderline-SMOTE and ADASYN. We propose Local Entropy-Guided Oversampling with a CVAE (LEO-CVAE), a generative oversampling framework that explicitly incorporates local uncertainty into both representation learning and data generation. To quantify uncertainty, we compute Shannon entropy over the class distribution in a sample’s neighborhood: high entropy indicates greater class overlap, serving as a proxy for uncertainty. LEO-CVAE leverages this signal through two mechanisms: (i) a Local Entropy-Weighted Loss (LEWL) that emphasizes robust learning in uncertain regions, and (ii) an entropy-guided sampling strategy that concentrates generation in these informative, class-overlapping areas. Applied to clinical genomics datasets (ADNI and TCGA lung cancer), LEO-CVAE consistently improves classifier performance, outperforming both traditional oversampling and generative baselines. These results highlight the value of uncertainty-aware generative oversampling for imbalanced learning in domains governed by complex nonlinear structures, such as omics data.

[1057] Autonomy-Aware Clustering: When Local Decisions Supersede Global Prescriptions

Amber Srivastava, Salar Basiri, Srinivasa Salapaka

Main category: cs.LG

TL;DR: This paper introduces autonomy-aware clustering, a reinforcement learning framework that accounts for entities’ local autonomy in clustering without requiring prior knowledge of autonomy forms, achieving much closer results to ground truth compared to traditional methods.

DetailsMotivation: Traditional clustering assumes passive entities that strictly conform to assigned groups, but in reality entities exhibit local autonomy that can reshape clustering outcomes, affecting cluster compositions, geometry, and cardinality with significant downstream effects.

Method: Integrates reinforcement learning with Deterministic Annealing (DA) procedure that promotes exploration early and exploitation later, plus an Adaptive Distance Estimation Network (ADEN) - a transformer-based attention model that learns dependencies between entities and cluster representatives within the RL loop.

Result: Empirical results show the framework closely aligns with underlying data dynamics: achieves solutions close to ground truth (gap ~3-4%) without explicit autonomy models, whereas ignoring autonomy leads to substantially larger gaps (~35-40%).

Conclusion: The proposed autonomy-aware clustering framework effectively accounts for entity autonomy in clustering problems, demonstrating significant improvements over traditional approaches that ignore such autonomy.

Abstract: Clustering arises in a wide range of problem formulations, yet most existing approaches assume that the entities under clustering are passive and strictly conform to their assigned groups. In reality, entities often exhibit local autonomy, overriding prescribed associations in ways not fully captured by feature representations. Such autonomy can substantially reshape clustering outcomes – altering cluster compositions, geometry, and cardinality – with significant downstream effects on inference and decision-making. We introduce autonomy-aware clustering, a reinforcement learning (RL) framework that learns and accounts for the influence of local autonomy without requiring prior knowledge of its form. Our approach integrates RL with a Deterministic Annealing (DA) procedure, where, to determine underlying clusters, DA naturally promotes exploration in early stages of annealing and transitions to exploitation later. We also show that the annealing procedure exhibits phase transitions that enable design of efficient annealing schedules. To further enhance adaptability, we propose the Adaptive Distance Estimation Network (ADEN), a transformer-based attention model that learns dependencies between entities and cluster representatives within the RL loop, accommodates variable-sized inputs and outputs, and enables knowledge transfer across diverse problem instances. Empirical results show that our framework closely aligns with underlying data dynamics: even without explicit autonomy models, it achieves solutions close to the ground truth (gap ~3-4%), whereas ignoring autonomy leads to substantially larger gaps (~35-40%). The code and data are publicly available at https://github.com/salar96/AutonomyAwareClustering.

[1058] Scalable Disk-Based Approximate Nearest Neighbor Search with Page-Aligned Graph

Dingyi Kang, Dongming Jiang, Hanshen Yang, Hang Liu, Bingzhe Li

Main category: cs.LG

TL;DR: PageANN is a disk-based ANNS framework that introduces a page-node graph structure aligned with SSD pages, reducing I/O operations and improving scalability for large-scale vector search.

DetailsMotivation: Existing disk-based ANNS methods suffer from long I/O traversal paths, misalignment with storage I/O granularity, and high in-memory indexing overhead, limiting scalability for large-scale vector search.

Method: PageANN clusters similar vectors into page nodes aligned with physical SSD pages, uses a co-designed disk data layout with merging technique to store only representative vectors and topology information, and implements memory management with lightweight indexing and coordinated memory-disk data allocation.

Result: PageANN achieves 1.85x-10.83x higher throughput and 51.7%-91.9% lower latency compared to state-of-the-art disk-based ANNS methods across different datasets and memory budgets, while maintaining comparable high recall accuracy.

Conclusion: PageANN provides a high-performance and scalable solution for disk-based approximate nearest neighbor search by optimizing I/O operations through page-node graph alignment and efficient memory management.

Abstract: Approximate Nearest Neighbor Search (ANNS), as the core of vector databases (VectorDBs), has become widely used in modern AI and ML systems, powering applications from information retrieval to bio-informatics. While graph-based ANNS methods achieve high query efficiency, their scalability is constrained by the available host memory. Recent disk-based ANNS approaches mitigate memory usage by offloading data to Solid-State Drives (SSDs). However, they still suffer from issues such as long I/O traversal path, misalignment with storage I/O granularity, and high in-memory indexing overhead, leading to significant I/O latency and ultimately limiting scalability for large-scale vector search. In this paper, we propose PageANN, a disk-based approximate nearest neighbor search (ANNS) framework designed for high performance and scalability. PageANN introduces a page-node graph structure that aligns logical graph nodes with physical SSD pages, thereby shortening I/O traversal paths and reducing I/O operations. Specifically, similar vectors are clustered into page nodes, and a co-designed disk data layout leverages this structure with a merging technique to store only representative vectors and topology information, avoiding unnecessary reads. To further improve efficiency, we design a memory management strategy that combines lightweight indexing with coordinated memory-disk data allocation, maximizing host memory utilization while minimizing query latency and storage overhead. Experimental results show that PageANN significantly outperforms state-of-the-art (SOTA) disk-based ANNS methods, achieving 1.85x-10.83x higher throughput and 51.7%-91.9% lower latency across different datasets and memory budgets, while maintaining comparable high recall accuracy.

[1059] Muon Outperforms Adam in Tail-End Associative Memory Learning

Shuche Wang, Fengzhuo Zhang, Jiaxiang Li, Cunxiao Du, Chao Du, Tianyu Pang, Zhuoran Yang, Mingyi Hong, Vincent Y. F. Tan

Main category: cs.LG

TL;DR: Muon optimizer outperforms Adam in LLM training due to its effectiveness with associative memory parameters (VO attention weights and FFNs) and superior handling of heavy-tailed data distributions, particularly for tail classes.

DetailsMotivation: To understand why Muon optimizer consistently trains LLMs faster than Adam, despite both being widely used optimization methods.

Method: Ablated transformer components optimized by Muon, analyzed singular spectrum properties, and theoretically analyzed a one-layer associative memory model under class-imbalanced data.

Result: Muon’s update rule produces more isotropic singular spectrum than Adam, enabling more effective optimization of tail classes in heavy-tailed data distributions.

Conclusion: Muon’s core advantage lies in its update rule aligning with the outer-product structure of linear associative memories, enabling balanced learning across classes regardless of feature embeddings.

Abstract: The Muon optimizer is consistently faster than Adam in training Large Language Models (LLMs), yet the mechanism underlying its success remains unclear. This paper demystifies this mechanism through the lens of associative memory. By ablating the transformer components optimized by Muon, we reveal that the associative memory parameters of LLMs, namely the Value and Output (VO) attention weights and Feed-Forward Networks (FFNs), are the primary contributors to Muon’s superiority. Motivated by this associative memory view, we then explain Muon’s superiority on real-world corpora, which are intrinsically heavy-tailed: a few classes (tail classes) appear far less frequently than others. The superiority is explained through two key properties: (i) its update rule consistently yields a more isotropic singular spectrum than Adam; and as a result, (ii) on heavy-tailed data, it optimizes tail classes more effectively than Adam. Beyond empirical evidence, we theoretically confirm these findings by analyzing a one-layer associative memory model under class-imbalanced data. We prove that Muon consistently achieves balanced learning across classes regardless of feature embeddings, whereas Adam can induce large disparities in learning errors depending on embedding properties. In summary, our empirical observations and theoretical analyses reveal Muon’s core advantage: its update rule aligns with the outer-product structure of linear associative memories, enabling more balanced and effective learning of tail classes in heavy-tailed distributions than Adam.

[1060] Layer-wise dynamic rank for compressing large language models

Zhendong Mi, Bian Sun, Grace Li Zhang, Shaoyi Huang

Main category: cs.LG

TL;DR: D-Rank is a dynamic rank allocation framework for LLM compression that addresses layer heterogeneity by using effective rank as an information density metric and optimizing rank distribution via Lagrange multipliers, outperforming uniform compression methods.

DetailsMotivation: Existing SVD-based LLM compression methods use uniform compression ratios across layers, ignoring the heterogeneity where middle layers encode richer information while early and late layers are more redundant.

Method: Proposes D-Rank framework with: 1) effective rank as information density metric, 2) Lagrange multiplier-based optimization for adaptive rank allocation, 3) rank rebalancing for attention layers, 4) extension to grouped-query attention LLMs.

Result: Consistently outperforms SVD-LLM, ASVD, and Basis Sharing with >15 lower perplexity on LLaMA-3-8B at 20% compression and up to 5% higher zero-shot reasoning accuracy on LLaMA-7B at 40% compression while achieving higher throughput.

Conclusion: D-Rank demonstrates that adaptive layer-wise rank allocation based on information density significantly improves LLM compression performance over uniform approaches, effectively addressing intra-layer heterogeneity.

Abstract: Large language models (LLMs) have rapidly scaled in size, bringing severe memory and computational challenges that hinder their deployment. Singular Value Decomposition (SVD)-based compression has emerged as an appealing post-training compression technique for LLMs, yet most existing methods apply a uniform compression ratio across all layers, implicitly assuming homogeneous information included in various layers. This overlooks the substantial intra-layer heterogeneity observed in LLMs, where middle layers tend to encode richer information while early and late layers are more redundant. In this work, we revisit the existing SVD-based compression method and propose D-Rank, a framework with layer-wise balanced Dynamic Rank allocation for LLMs compression. We first introduce effective rank as a principled metric to measure the information density of weight matrices, and then allocate ranks via a Lagrange multiplier-based optimization scheme to adaptively assign more capacity to groups with higher information density under a fixed compression ratio. Moreover, we rebalance the allocated ranks across attention layers to account for their varying importance and extend D-Rank to latest LLMs with grouped-query attention. Extensive experiments on various LLMs with different scales across multiple compression ratios demonstrate that D-Rank consistently outperforms SVD-LLM, ASVD, and Basis Sharing, achieving more than 15 lower perplexity with LLaMA-3-8B model on C4 datasets at 20% compression ratio and up to 5% higher zero-shot reasoning accuracy with LLaMA-7B model at 40% compression ratio while achieving even higher throughput.

[1061] MG2FlowNet: Accelerating High-Reward Sample Generation via Enhanced MCTS and Greediness Control

Rui Zhu, Xuan Yu, Yudong Zhang, Chen Zhang, Xu Wang, Yang Wang

Main category: cs.LG

TL;DR: Integrating enhanced Monte Carlo Tree Search (MCTS) into GFlowNets to improve high-reward sample generation while maintaining diversity.

DetailsMotivation: Existing GFlowNets struggle to consistently generate high-reward samples in large search spaces with sparse high-reward regions, needing better balance between diversity and reward optimization.

Method: Enhanced MCTS integration using MCTS-based policy evaluation and Polynomial Upper Confidence Trees (PUCT) with controllable greediness mechanism to balance exploration and exploitation.

Result: Accelerates discovery of high-reward regions and continuously generates high-reward samples while preserving generative distribution diversity.

Conclusion: The method successfully enhances exploitation without sacrificing diversity in GFlowNets through dynamic exploration-reward balance.

Abstract: Generative Flow Networks (GFlowNets) have emerged as a powerful tool for generating diverse and high-reward structured objects by learning to sample from a distribution proportional to a given reward function. Unlike conventional reinforcement learning (RL) approaches that prioritize optimization of a single trajectory, GFlowNets seek to balance diversity and reward by modeling the entire trajectory distribution. This capability makes them especially suitable for domains such as molecular design and combinatorial optimization. However, existing GFlowNets sampling strategies tend to overexplore and struggle to consistently generate high-reward samples, particularly in large search spaces with sparse high-reward regions. Therefore, improving the probability of generating high-reward samples without sacrificing diversity remains a key challenge under this premise. In this work, we integrate an enhanced Monte Carlo Tree Search (MCTS) into the GFlowNets sampling process, using MCTS-based policy evaluation to guide the generation toward high-reward trajectories and Polynomial Upper Confidence Trees (PUCT) to balance exploration and exploitation adaptively, and we introduce a controllable mechanism to regulate the degree of greediness. Our method enhances exploitation without sacrificing diversity by dynamically balancing exploration and reward-driven guidance. The experimental results show that our method can not only accelerate the speed of discovering high-reward regions but also continuously generate high-reward samples, while preserving the diversity of the generative distribution. All implementations are available at https://github.com/ZRNB/MG2FlowNet.

[1062] Learning Passive Continuous-Time Dynamics with Multistep Port-Hamiltonian Gaussian Processes

Chi Ho Leung, Philip E. Paré

Main category: cs.LG

TL;DR: The paper proposes MS-PHS GP, a method that learns physically consistent continuous-time dynamics and Hamiltonian posterior from noisy trajectory data using Gaussian processes and multistep integrators.

DetailsMotivation: To learn physically consistent continuous-time dynamics from noisy, irregularly-sampled trajectories while enforcing energy balance and passivity constraints.

Method: Places GP prior on Hamiltonian surface and encodes variable-step multistep integrator constraints as finite linear functionals, enabling closed-form conditioning of vector field and Hamiltonian without latent states.

Result: Achieves improved vector-field recovery and well-calibrated Hamiltonian uncertainty on mass-spring, Van der Pol, and Duffing benchmarks, with finite-sample vector-field bound separating estimation and discretization terms.

Conclusion: MS-PHS GP provides an effective framework for learning physically consistent dynamics with Hamiltonian uncertainty quantification while maintaining energy balance and passivity by design.

Abstract: We propose the multistep port-Hamiltonian Gaussian process (MS-PHS GP) to learn physically consistent continuous-time dynamics and a posterior over the Hamiltonian from noisy, irregularly-sampled trajectories. By placing a GP prior on the Hamiltonian surface $H$ and encoding variable-step multistep integrator constraints as finite linear functionals, MS-PHS GP enables closed-form conditioning of both the vector field and the Hamiltonian surface without latent states, while enforcing energy balance and passivity by design. We state a finite-sample vector-field bound that separates the estimation and variable-step discretization terms. Lastly, we demonstrate improved vector-field recovery and well-calibrated Hamiltonian uncertainty on mass-spring, Van der Pol, and Duffing benchmarks.

[1063] Rethinking KL Regularization in RLHF: From Value Estimation to Gradient Optimization

Kezhao Liu, Jason Klein Liu, Mingtao Chen, Yiming Liu

Main category: cs.LG

TL;DR: The paper establishes a unified framework showing that ‘$k_1$ in reward’ (like PPO) and ‘$k_2$ as loss’ are gradient-equivalent and theoretically sound implementations of Reverse KL regularization, while ‘$k_3$ as loss’ (like GRPO) is a biased approximation.

DetailsMotivation: To address the inconsistent implementation of KL divergence loss in RLHF methods, where some approaches treat it as a detached coefficient while others use it as a direct loss function, leading to potential theoretical issues.

Method: Developed a unified framework connecting different KL regularization implementation styles, proving gradient equivalence between ‘$k_1$ in reward’ and ‘$k_2$ as loss’ under on-policy conditions, and analyzing bias in off-policy implementations.

Result: Proved that ‘$k_1$ in reward’ and ‘$k_2$ as loss’ are gradient-equivalent and principled implementations of RKL regularization, while ‘$k_3$ as loss’ is a biased first-order approximation. Also identified bias in off-policy implementations and proposed a correction.

Conclusion: The work provides a comprehensive gradient-based rationale for choosing correct KL regularization implementations, enabling more robust RLHF systems by unifying different implementation perspectives and identifying principled approaches.

Abstract: Reinforcement Learning from Human Feedback (RLHF) leverages a Kullback-Leibler (KL) divergence loss to stabilize training and prevent overfitting. However, in methods such as GRPO, its implementation may be guided by principles from numerical value estimation-a practice that overlooks the term’s functional role as an optimization loss. To analyze this issue, we establish a unified framework that connects two seemingly distinct implementation styles: using the mathematical term $k_n$ as a detached coefficient for the policy’s score function (’$k_n$ in reward’) or as a direct loss function through which gradients are propagated (’$k_n$ as loss’). We show that the latter can always be analyzed via an equivalent gradient coefficient in the former, unifying the two perspectives. Through this framework, we prove that the conventional ‘$k_1$ in reward’ (like in PPO) is the principled loss for Reverse KL (RKL) regularization. We further establish a key finding: under on-policy conditions, the ‘$k_2$ as loss’ formulation is, in fact, gradient-equivalent to ‘$k_1$ in reward’. This equivalence, first proven in our work, identifies both as the theoretically sound implementations of the RKL objective. In contrast, we show that the recently adopted ‘$k_3$ as loss’ (like in GRPO) is merely a first-order, biased approximation of the principled loss. Furthermore, we argue that common off-policy implementations of ‘$k_n$ as loss’ methods are biased due to neglected importance sampling, and we propose a principled correction. Our findings provide a comprehensive, gradient-based rationale for choosing and correctly implementing KL regularization, paving the way for more robust and effective RLHF systems.

[1064] PEL-NAS: Search Space Partitioned Architecture Prompt Co-Evolutionary LLM-driven Hardware-Aware Neural Architecture Search

Hengyi Zhu, Grace Li Zhang, Shaoyi Huang

Main category: cs.LG

TL;DR: PEL-NAS is a novel LLM-driven Neural Architecture Search method that addresses exploration bias in traditional approaches by partitioning the search space, using co-evolutionary prompts, and zero-cost predictors to efficiently find high-accuracy, low-latency neural networks.

DetailsMotivation: Traditional HW-NAS methods require multiple GPU days per dataset, while LLM-driven approaches suffer from exploration bias - repeatedly proposing designs within limited search space and failing to discover architectures across different latency ranges.

Method: Three key components: 1) Complexity-driven partitioning engine to divide search space by complexity for diversity; 2) LLM-powered architecture prompt co-evolution operator that updates design heuristics knowledge base and performs guided evolution; 3) Zero-cost predictor to avoid training candidates from scratch.

Result: On HW-NAS-Bench, PEL-NAS achieves higher HV, lower IGD, and up to 54% lower latency than baselines at similar accuracy. Search cost drops from days to minutes compared to traditional supernet baselines.

Conclusion: PEL-NAS effectively addresses exploration bias in LLM-driven NAS, enabling efficient discovery of diverse, high-performance neural architectures across the entire latency spectrum with significantly reduced search time.

Abstract: Hardware-Aware Neural Architecture Search (HW-NAS) requires joint optimization of accuracy and latency under device constraints. Traditional supernet-based methods require multiple GPU days per dataset. Large Language Model (LLM)-driven approaches avoid training a large supernet and can provide quick feedback, but we observe an exploration bias: the LLM repeatedly proposes neural network designs within limited search space and fails to discover architectures across different latency ranges in the entire search space. To address this issue, we propose PEL-NAS: a search space Partitioned, architecture prompt co-Evolutionary and LLM-driven Neural Architecture Search that can generate neural networks with high accuracy and low latency with reduced search cost. Our proposed PEL-NAS has three key components: 1) a complexity-driven partitioning engine that divides the search space by complexity to enforce diversity and mitigate exploration bias; 2) an LLM-powered architecture prompt co-evolution operator, in which the LLM first updates a knowledge base of design heuristics based on results from the previous round, then performs a guided evolution algorithm on architectures with prompts that incorporate this knowledge base. Prompts and designs improve together across rounds which avoids random guesswork and improve efficiency; 3) a zero-cost predictor to avoid training a large number of candidates from scratch. Experimental results show that on HW-NAS-Bench, PEL-NAS can achieve overall higher HV, lower IGD, and up to 54% lower latency than baselines at similar accuracy. Meanwhile, the search cost drops from days to minutes compared with traditional supernet baselines.

[1065] Learning Representations Through Contrastive Neural Model Checking

Vladimir Krsmanovic, Matthias Cosler, Mohamed Ghanem, Bernd Finkbeiner

Main category: cs.LG

TL;DR: CNML introduces contrastive learning for model checking, embedding specifications and systems into a shared latent space to improve verification performance and enable transfer learning.

DetailsMotivation: Representation learning is underexplored in formal verification despite its success in vision and language domains. The paper aims to leverage model checking as a guiding signal for learning aligned representations.

Method: Contrastive Neural Model Checking (CNML) uses self-supervised contrastive learning to jointly embed logical specifications and systems into a shared latent space.

Result: CNML significantly outperforms algorithmic and neural baselines on industry-inspired retrieval tasks in both cross-modal and intra-modal settings. Learned representations transfer well to downstream tasks and generalize to complex formulas.

Conclusion: Model checking can effectively serve as an objective for learning representations for formal languages, demonstrating the value of representation learning in formal verification.

Abstract: Model checking is a key technique for verifying safety-critical systems against formal specifications, where recent applications of deep learning have shown promise. However, while ubiquitous for vision and language domains, representation learning remains underexplored in formal verification. We introduce Contrastive Neural Model Checking (CNML), a novel method that leverages the model checking task as a guiding signal for learning aligned representations. CNML jointly embeds logical specifications and systems into a shared latent space through a self-supervised contrastive objective. On industry-inspired retrieval tasks, CNML considerably outperforms both algorithmic and neural baselines in cross-modal and intra-modal settings. We further show that the learned representations effectively transfer to downstream tasks and generalize to more complex formulas. These findings demonstrate that model checking can serve as an objective for learning representations for formal languages.

[1066] PENEX: AdaBoost-Inspired Neural Network Regularization

Klaus-Rudolf Kladny, Bernhard Schölkopf, Michael Muehlebach

Main category: cs.LG

TL;DR: PENEX is a new multi-class exponential loss formulation that enables first-order optimization, implicitly maximizes margins, and shows better regularization than established methods in computer vision and language tasks.

DetailsMotivation: AdaBoost uses exponential loss for sequential weak learner fitting but paradoxically generalizes well despite the aggressive penalty. Existing exponential loss formulations are not amenable to first-order optimization methods.

Method: Introduces Penalized Exponential Loss (PENEX), a theoretically grounded multi-class exponential loss formulation that can be optimized via first-order methods. It implicitly maximizes margins and parameterizes weak learners through gradient increments.

Result: PENEX exhibits superior regularization effects compared to established methods with similar computational cost across computer vision and language tasks. It demonstrates implicit margin maximization and effective training capabilities.

Conclusion: PENEX serves as an AdaBoost-inspired alternative for effective training and fine-tuning of deep neural networks, combining theoretical grounding with practical optimization benefits.

Abstract: AdaBoost sequentially fits so-called weak learners to minimize an exponential loss, which penalizes mislabeled data points more severely than other loss functions like cross-entropy. Paradoxically, AdaBoost generalizes well in practice as the number of weak learners grows. In the present work, we introduce Penalized Exponential Loss (PENEX), a new formulation of the multi-class exponential loss that is theoretically grounded and, in contrast to the existing formulation, amenable to optimization via first-order methods. We demonstrate both empirically and theoretically that PENEX implicitly maximizes margins of data points. Also, we show that gradient increments on PENEX implicitly parameterize weak learners in the boosting framework. Across computer vision and language tasks, we show that PENEX exhibits a regularizing effect often better than established methods with similar computational cost. Our results highlight PENEX’s potential as an AdaBoost-inspired alternative for effective training and fine-tuning of deep neural networks.

[1067] MINERVA: Mutual Information Neural Estimation for Supervised Feature Selection

Taurai Muvunza, Egor Kraev, Pere Planell-Morell, Alexander Y. Shestopaloff

Main category: cs.LG

TL;DR: MINERVA is a neural network-based feature selection method that uses mutual information estimation to capture complex feature-target relationships, including higher-order interactions that traditional pair-wise dependence metrics miss.

DetailsMotivation: Traditional feature filters fail when targets depend on higher-order feature interactions rather than individual contributions, limiting their ability to capture complex dependency structures.

Method: Two-stage neural network approach: parameterize mutual information approximation with neural networks, use sparsity-inducing regularizers in loss function, and decouple representation learning from feature selection for better generalization.

Result: Effectively captures complex feature-target relationships and higher-order dependencies that are rarely captured in literature, with experimental validation on synthetic and real-life fraud datasets.

Conclusion: MINERVA provides an effective solution for supervised feature selection that handles complex dependency structures through neural estimation of mutual information and ensemble evaluation of feature subsets.

Abstract: Existing feature filters rely on statistical pair-wise dependence metrics to model feature-target relationships, but this approach may fail when the target depends on higher-order feature interactions rather than individual contributions. We introduce Mutual Information Neural Estimation Regularized Vetting Algorithm (MINERVA), a novel approach to supervised feature selection based on neural estimation of mutual information between features and targets. We paramaterize the approximation of mutual information with neural networks and perform feature selection using a carefully designed loss function augmented with sparsity-inducing regularizers. Our method is implemented in a two-stage process to decouple representation learning from feature selection, ensuring better generalization and a more accurate expression of feature importance. We present examples of ubiquitous dependency structures that are rarely captured in literature and show that our proposed method effectively captures these complex feature-target relationships by evaluating feature subsets as an ensemble. Experimental results on synthetic and real-life fraud datasets demonstrate the efficacy of our method and its ability to perform exact solutions.

[1068] Diffusion^2: Turning 3D Environments into Radio Frequency Heatmaps

Kyoungjun Park, Yifan Yang, Changhan Ge, Lili Qiu, Shiqi Jiang

Main category: cs.LG

TL;DR: Diffusion^2 is a diffusion-based method that uses 3D point clouds to model RF signal propagation across multiple frequencies, achieving high accuracy with 1.9 dB error and 27x speedup over existing methods.

DetailsMotivation: RF signal propagation modeling is crucial for environmental understanding and wireless applications, but existing methods struggle with complex environments due to signal interactions with obstacles. RGB cameras have limitations in spectrum, coverage, and occlusion handling.

Method: Uses diffusion-based approach with 3D point clouds and introduces RF-3D Encoder to capture RF-related features from 3D geometry. Features undergo multi-scale embedding to simulate RF signal dissemination process.

Result: Achieves accurate RF signal behavior estimation across various frequency bands and environmental conditions with only 1.9 dB error margin and 27x faster performance compared to existing methods.

Conclusion: Diffusion^2 represents a significant advancement in RF signal propagation modeling, providing accurate and efficient prediction capabilities for complex environments across multiple frequency ranges.

Abstract: Modeling radio frequency (RF) signal propagation is essential for understanding the environment, as RF signals offer valuable insights beyond the capabilities of RGB cameras, which are limited by the visible-light spectrum, lens coverage, and occlusions. It is also useful for supporting wireless diagnosis, deployment, and optimization. However, accurately predicting RF signals in complex environments remains a challenge due to interactions with obstacles such as absorption and reflection. We introduce Diffusion^2, a diffusion-based approach that uses 3D point clouds to model the propagation of RF signals across a wide range of frequencies, from Wi-Fi to millimeter waves. To effectively capture RF-related features from 3D data, we present the RF-3D Encoder, which encapsulates the complexities of 3D geometry along with signal-specific details. These features undergo multi-scale embedding to simulate the actual RF signal dissemination process. Our evaluation, based on synthetic and real-world measurements, demonstrates that Diffusion^2 accurately estimates the behavior of RF signals in various frequency bands and environmental conditions, with an error margin of just 1.9 dB and 27x faster than existing methods, marking a significant advancement in the field. Refer to https://rfvision-project.github.io/ for more information.

[1069] Relevance-Aware Thresholding in Online Conformal Prediction for Time Series

Théo Dupuy, Binbin Xu, Stéphane Perrey, Jacky Montmain, Abdelhak Imoussaten

Main category: cs.LG

TL;DR: The paper proposes enhancing Online Conformal Prediction (OCP) by replacing binary evaluation of prediction intervals with relevance-quantifying functions to prevent abrupt threshold changes and produce narrower intervals while maintaining coverage validity.

DetailsMotivation: Most existing OCP methods focus only on whether ground truth falls inside/outside prediction intervals during threshold updates, without considering interval relevance. This paper aims to leverage this overlooked aspect to improve prediction interval quality.

Method: Enhanced threshold update step that replaces binary evaluation (inside/outside) with a broader class of functions that quantify prediction interval relevance using ground truth, preventing abrupt threshold changes.

Result: Experimental results on real-world datasets show that the proposed functions can produce tighter prediction intervals compared to existing OCP methods while maintaining coverage validity.

Conclusion: Quantifying prediction interval relevance through enhanced threshold update functions leads to narrower intervals without compromising coverage validity, improving the overall quality of uncertainty quantification in time series forecasting.

Abstract: Uncertainty quantification has received considerable interest in recent works in Machine Learning. In particular, Conformal Prediction (CP) gains ground in this field. For the case of time series, Online Conformal Prediction (OCP) becomes an option to address the problem of data distribution shift over time. Indeed, the idea of OCP is to update a threshold of some quantity (whether the miscoverage level or the quantile) based on the distribution observation. To evaluate the performance of OCP methods, two key aspects are typically considered: the coverage validity and the prediction interval width minimization. Recently, new OCP methods have emerged, offering long-run coverage guarantees and producing more informative intervals. However, during the threshold update step, most of these methods focus solely on the validity of the prediction intervals~–that is, whether the ground truth falls inside or outside the interval–~without accounting for their relevance. In this paper, we aim to leverage this overlooked aspect. Specifically, we propose enhancing the threshold update step by replacing the binary evaluation (inside/outside) with a broader class of functions that quantify the relevance of the prediction interval using the ground truth. This approach helps prevent abrupt threshold changes, potentially resulting in narrower prediction intervals. Indeed, experimental results on real-world datasets suggest that these functions can produce tighter intervals compared to existing OCP methods while maintaining coverage validity.

[1070] Distributional Inverse Reinforcement Learning

Feiyang Wu, Ye Zhao, Anqi Wu

Main category: cs.LG

TL;DR: A distributional framework for offline IRL that models uncertainty over rewards and return distributions, capturing richer expert behavior structure through FSD minimization and DRMs.

DetailsMotivation: Conventional IRL methods only recover deterministic rewards or match expected returns, lacking the ability to capture richer structure in expert behavior and reward distributions.

Method: Jointly models uncertainty over reward functions and full return distributions by minimizing first-order stochastic dominance violations and integrating distortion risk measures into policy learning.

Result: Empirical results on synthetic benchmarks, real-world neurobehavioral data, and MuJoCo control tasks show expressive reward representations and state-of-the-art imitation performance.

Conclusion: The distributional framework enables recovery of both reward distributions and distribution-aware policies, making it well-suited for behavior analysis and risk-aware imitation learning.

Abstract: We propose a distributional framework for offline Inverse Reinforcement Learning (IRL) that jointly models uncertainty over reward functions and full distributions of returns. Unlike conventional IRL approaches that recover a deterministic reward estimate or match only expected returns, our method captures richer structure in expert behavior, particularly in learning the reward distribution, by minimizing first-order stochastic dominance (FSD) violations and thus integrating distortion risk measures (DRMs) into policy learning, enabling the recovery of both reward distributions and distribution-aware policies. This formulation is well-suited for behavior analysis and risk-aware imitation learning. Empirical results on synthetic benchmarks, real-world neurobehavioral data, and MuJoCo control tasks demonstrate that our method recovers expressive reward representations and achieves state-of-the-art imitation performance.

cs.MA

Sanket Badhe

Main category: cs.MA

TL;DR: LegalSim is a multi-agent simulation that shows how AI can exploit procedural weaknesses in legal systems through emergent exploit chains while remaining procedurally valid.

DetailsMotivation: To explore how AI systems can exploit procedural weaknesses in codified legal rules and motivate red-teaming of legal rule systems beyond model-level testing.

Method: Modular multi-agent simulation with plaintiff/defendant agents using constrained action space governed by JSON rules engine, with stochastic judge model. Compared four policies: PPO, contextual bandit with LLM, direct LLM policy, and hand-crafted heuristic.

Result: PPO wins most often, bandit is most consistently competitive, LLM trails them, heuristic is weakest. Emergent exploit chains observed including cost-inflating discovery sequences and calendar-pressure tactics that remain procedurally valid but systemically harmful.

Conclusion: Simulation reveals emergent exploit chains in legal proceedings, motivating the need for red-teaming of legal rule systems in addition to model-level testing.

Abstract: We present LegalSim, a modular multi-agent simulation of adversarial legal proceedings that explores how AI systems can exploit procedural weaknesses in codified rules. Plaintiff and defendant agents choose from a constrained action space (for example, discovery requests, motions, meet-and-confer, sanctions) governed by a JSON rules engine, while a stochastic judge model with calibrated grant rates, cost allocations, and sanction tendencies resolves outcomes. We compare four policies: PPO, a contextual bandit with an LLM, a direct LLM policy, and a hand-crafted heuristic; Instead of optimizing binary case outcomes, agents are trained and evaluated using effective win rate and a composite exploit score that combines opponent-cost inflation, calendar pressure, settlement pressure at low merit, and a rule-compliance margin. Across configurable regimes (e.g., bankruptcy stays, inter partes review, tax procedures) and heterogeneous judges, we observe emergent ``exploit chains’’, such as cost-inflating discovery sequences and calendar-pressure tactics that remain procedurally valid yet systemically harmful. Evaluation via cross-play and Bradley-Terry ratings shows, PPO wins more often, the bandit is the most consistently competitive across opponents, the LLM trails them, and the heuristic is weakest. The results are stable in judge settings, and the simulation reveals emergent exploit chains, motivating red-teaming of legal rule systems in addition to model-level testing.

[1072] Long-Term Mapping of the Douro River Plume with Multi-Agent Reinforcement Learning

Nicolò Dal Fabbro, Milad Mesbahi, Renato Mendes, João Borges de Sousa, George J. Pappas

Main category: cs.MA

TL;DR: Multi-agent reinforcement learning approach for long-term river plume mapping using AUVs, integrating spatiotemporal Gaussian process regression with multi-head Q-network controllers.

DetailsMotivation: To enable energy- and communication-efficient long-term monitoring of dynamic river plumes using multiple autonomous underwater vehicles.

Method: Combines spatiotemporal Gaussian process regression with multi-head Q-network controllers that regulate AUV direction and speed, using intermittent central coordination.

Result: Outperforms single- and multi-agent benchmarks in simulations, with scaling agents improving MSE and endurance. Doubling AUVs can more than double endurance while maintaining accuracy.

Conclusion: Learned policies generalize across seasonal regimes, showing promise for data-driven long-term monitoring of dynamic plume environments.

Abstract: We study the problem of long-term (multiple days) mapping of a river plume using multiple autonomous underwater vehicles (AUVs), focusing on the Douro river representative use-case. We propose an energy - and communication - efficient multi-agent reinforcement learning approach in which a central coordinator intermittently communicates with the AUVs, collecting measurements and issuing commands. Our approach integrates spatiotemporal Gaussian process regression (GPR) with a multi-head Q-network controller that regulates direction and speed for each AUV. Simulations using the Delft3D ocean model demonstrate that our method consistently outperforms both single- and multi-agent benchmarks, with scaling the number of agents both improving mean squared error (MSE) and operational endurance. In some instances, our algorithm demonstrates that doubling the number of AUVs can more than double endurance while maintaining or improving accuracy, underscoring the benefits of multi-agent coordination. Our learned policies generalize across unseen seasonal regimes over different months and years, demonstrating promise for future developments of data-driven long-term monitoring of dynamic plume environments.

[1073] Cooperative Flexibility Exchange: Fair and Comfort-Aware Decentralized Resource Allocation

Rabiya Khalid, Evangelos Pournaras

Main category: cs.MA

TL;DR: A decentralized multi-agent coordination system for demand-side energy management that uses slot exchange mechanism to improve user comfort while maintaining system efficiency.

DetailsMotivation: Existing energy management systems prioritize system efficiency over user comfort, creating a gap that needs to be addressed for better smart grid solutions.

Method: Proposes a decentralized multi-agent coordination system with slot exchange mechanism, where agents first receive optimized appliance schedules and then coordinate to adjust schedules through slot exchanges.

Result: The slot exchange mechanism increases user comfort and fairness without raising system inefficiency cost, and scales well with large populations even with non-altruistic agent behavior.

Conclusion: The proposed system provides a practical and scalable solution for future smart grids that balances both system efficiency and user comfort.

Abstract: The growing electricity demand and increased use of smart appliances are placing new pressures on power grids, making efficient energy management more important than ever. The existing energy management systems often prioritize system efficiency (balanced energy demand and supply) at the expense of user comfort. This paper addresses this gap by proposing a novel decentralized multi-agent coordination-based demand-side management system. The proposed system enables individual agents to coordinate for demand-side energy optimization while improving the user comfort and maintaining the system efficiency. A key innovation of this work is the introduction of a slot exchange mechanism, where agents first receive optimized appliance-level energy consumption schedules and then coordinate with each other to adjust these schedules through slot exchanges. This approach improves user comfort even when agents show non-altruistic behaviour, and it scales well with large populations. The system also promotes fairness by balancing satisfaction levels across users. For performance evaluation, a real-world dataset is used, and the results demonstrate that the proposed slot exchange mechanism increases user comfort and fairness without raising system inefficiency cost, making it a practical and scalable solution for future smart grids.

[1074] Small Fleet, Big Impact: Enhancing Shared Micromobility Efficiency through Minimal Autonomous Vehicle Deployment

Heng Tan, Hua Yan, Lucas Yang, Yu Yang

Main category: cs.MA

TL;DR: SMART is a hierarchical reinforcement learning framework that integrates autonomous shared micromobility vehicles (ASMVs) with self-rebalancing capabilities to dynamically adapt to real-time demand fluctuations in shared micromobility systems.

DetailsMotivation: Existing shared micromobility vehicle scheduling methods redistribute vehicles only once or twice per day, making them vulnerable to performance degradation under atypical conditions due to spatio-temporal demand fluctuations.

Method: A hierarchical reinforcement learning framework that jointly optimizes high-level initial deployment and low-level real-time rebalancing for autonomous shared micromobility vehicles (ASMVs).

Result: Evaluation based on real-world e-scooter usage data from Chicago shows the framework is highly effective with strong generalization capability, significantly enhancing overall micromobility service performance.

Conclusion: The SMART framework allows seamless integration with existing vehicle scheduling methods and significantly improves service performance by dynamically adapting to real-time demand through autonomous rebalancing vehicles.

Abstract: Shared micromobility systems, such as electric scooters and bikes, have gained widespread popularity as sustainable alternatives to traditional transportation modes. However, these systems face persistent challenges due to spatio-temporal demand fluctuations, often resulting in a mismatch between vehicle supply and user demand. Existing shared micromobility vehicle scheduling methods typically redistribute vehicles once or twice per day, which makes them vulnerable to performance degradation under atypical conditions. In this work, we design to augment existing micromobility scheduling methods by integrating a small number of autonomous shared micromobility vehicles (ASMVs), which possess self-rebalancing capabilities to dynamically adapt to real-time demand. Specifically, we introduce SMART, a hierarchical reinforcement learning framework that jointly optimizes high-level initial deployment and low-level real-time rebalancing for ASMVs. We evaluate our framework based on real-world e-scooter usage data from Chicago. Our experiment results show that our framework is highly effective and possesses strong generalization capability, allowing it to seamlessly integrate with existing vehicle scheduling methods and significantly enhance overall micromobility service performance.

[1075] Audit the Whisper: Detecting Steganographic Collusion in Multi-Agent LLMs

Om Tailor

Main category: cs.MA

TL;DR: Audit the Whisper is a comprehensive auditing framework for detecting covert coordination among LLM agents in multi-agent systems, combining theoretical guarantees, benchmark design, and reproducible infrastructure.

DetailsMotivation: Covert coordination among LLM agents in market, allocation, and governance workflows can silently erode trust and social welfare, while existing audits lack theoretical guarantees and reproducibility.

Method: The framework includes: channel-capacity analysis with interventions (paraphrase, rate limiting, role permutation), ColludeBench-v0 benchmark covering pricing/auctions/peer review, and a calibrated auditing pipeline using mutual information, permutation invariance, watermark variance, and fairness-aware acceptance bias.

Result: Across 600 audited runs spanning 12 intervention conditions, the union meta-test achieved true positive rate = 1 with zero false alarms, while ablations revealed trade-offs and fairness-driven colluders invisible to mutual information alone.

Conclusion: The framework provides complete reproducibility with regeneration scripts, seed-stamped manifests, and documentation, enabling external auditors to reproduce results and extend the framework with minimal effort.

Abstract: Multi-agent deployments of large language models (LLMs) are increasingly embedded in market, allocation, and governance workflows, yet covert coordination among agents can silently erode trust and social welfare. Existing audits are dominated by heuristics that lack theoretical guarantees, struggle to transfer across tasks, and seldom ship with the infrastructure needed for independent replication. We introduce \emph{Audit the Whisper}, a conference-grade research artifact that spans theory, benchmark design, detection, and reproducibility. Our contributions are: (i) a channel-capacity analysis showing how interventions such as paraphrase, rate limiting, and role permutation impose quantifiable capacity penalties – operationalized via paired-run Kullback–Leibler diagnostics – that tighten mutual-information thresholds with finite-sample guarantees; (ii) \textsc{ColludeBench}-v0, covering pricing, first-price auctions, and peer review with configurable covert schemes, deterministic manifests, and reward instrumentation; and (iii) a calibrated auditing pipeline that fuses cross-run mutual information, permutation invariance, watermark variance, and fairness-aware acceptance bias, each tuned to a (10^{-3}) false-positive budget. Across 600 audited runs spanning 12 intervention conditions, the union meta-test attains TPR~$=1$ with zero observed false alarms, while ablations surface the price-of-auditing trade-off and highlight fairness-driven colluders invisible to MI alone. We release regeneration scripts, seed-stamped manifests, and documentation so that external auditors can reproduce every figure and extend the framework with minimal effort.

[1076] NegotiationGym: Self-Optimizing Agents in a Multi-Agent Social Simulation Environment

Shashank Mangla, Chris Hokamp, Jack Boylan, Demian Gholipour Ghalandari, Yuuv Jauhari, Lauren Cassidy, Oisin Duffy

Main category: cs.MA

TL;DR: NegotiationGym is a user-friendly API and interface for creating and running multi-agent social simulations focused on negotiation and cooperation.

DetailsMotivation: To provide an accessible platform for designing and customizing negotiation-focused multi-agent simulations with easy configuration.

Method: Uses a configuration-driven API that allows scenario design, agent-level utility functions for optimization criteria, and enables agents to self-optimize through multiple interaction rounds and strategy modification.

Result: Successfully implemented NegotiationGym as a working system for multi-agent social simulations.

Conclusion: NegotiationGym provides an effective framework for creating and running customizable negotiation simulations where agents can learn and adapt their strategies through repeated interactions.

Abstract: We design and implement NegotiationGym, an API and user interface for configuring and running multi-agent social simulations focused upon negotiation and cooperation. The NegotiationGym codebase offers a user-friendly, configuration-driven API that enables easy design and customization of simulation scenarios. Agent-level utility functions encode optimization criteria for each agent, and agents can self-optimize by conducting multiple interaction rounds with other agents, observing outcomes, and modifying their strategies for future rounds.

[1077] Trade in Minutes! Rationality-Driven Agentic System for Quantitative Financial Trading

Zifan Song, Kaitao Song, Guosheng Hu, Ding Qi, Junyao Gao, Xiaohua Wang, Dongsheng Li, Cairong Zhao

Main category: cs.MA

TL;DR: TiMi is a rationality-driven multi-agent system that decouples strategy development from minute-level deployment in quantitative trading, using LLMs for semantic analysis, code programming, and mathematical reasoning.

DetailsMotivation: Current financial trading agents introduce emotional biases and rely on peripheral information while requiring continuous inference during deployment. The paper aims to harmonize strategic depth with mechanical rationality for quantitative trading.

Method: Two-tier analytical paradigm (macro patterns to micro customization), layered programming design for trading bot implementation, and closed-loop optimization driven by mathematical reflection using LLM capabilities.

Result: Extensive evaluations across 200+ trading pairs in stock and cryptocurrency markets show TiMi achieves stable profitability, action efficiency, and risk control under volatile market dynamics.

Conclusion: TiMi successfully demonstrates the efficacy of rationality-driven multi-agent systems in quantitative trading by architecturally separating strategy development from deployment and leveraging specialized LLM capabilities.

Abstract: Recent advancements in large language models (LLMs) and agentic systems have shown exceptional decision-making capabilities, revealing significant potential for autonomic finance. Current financial trading agents predominantly simulate anthropomorphic roles that inadvertently introduce emotional biases and rely on peripheral information, while being constrained by the necessity for continuous inference during deployment. In this paper, we pioneer the harmonization of strategic depth in agents with the mechanical rationality essential for quantitative trading. Consequently, we present TiMi (Trade in Minutes), a rationality-driven multi-agent system that architecturally decouples strategy development from minute-level deployment. TiMi leverages specialized LLM capabilities of semantic analysis, code programming, and mathematical reasoning within a comprehensive policy-optimization-deployment chain. Specifically, we propose a two-tier analytical paradigm from macro patterns to micro customization, layered programming design for trading bot implementation, and closed-loop optimization driven by mathematical reflection. Extensive evaluations across 200+ trading pairs in stock and cryptocurrency markets empirically validate the efficacy of TiMi in stable profitability, action efficiency, and risk control under volatile market dynamics.

[1078] The Hive Mind is a Single Reinforcement Learning Agent

Karthik Soma, Yann Bouteiller, Heiko Hamann, Giovanni Beltrame

Main category: cs.MA

TL;DR: The paper establishes an equivalence between collective decision-making via imitation and individual trial-and-error learning, showing that honey bee swarm behavior is equivalent to a single reinforcement learning agent.

DetailsMotivation: To understand how natural systems converge to optimal strategies through different mechanisms and bridge the gap between collective imitation and individual reinforcement learning.

Method: Analyzed the collective decision-making model of nest-hunting in honey bee swarms, showing that individual bees following local imitation rules create an emergent distributed cognition equivalent to a single online RL agent.

Result: Demonstrated that group-level imitation behavior is mathematically equivalent to Maynard-Cross Learning, a bandit algorithm, showing how simple individual behaviors can create complex collective intelligence.

Conclusion: Groups of cognition-limited organisms can be equivalent to more complex reinforcement-enabled entities, providing insights into evolution of imitation strategies and applications in swarm intelligence, economics, and social systems.

Abstract: Decision-making is an essential attribute of any intelligent agent or group. Natural systems are known to converge to optimal strategies through at least two distinct mechanisms: collective decision-making via imitation of others, and individual trial-and-error. This paper establishes an equivalence between these two paradigms by drawing from the well-established collective decision-making model of nest-hunting in swarms of honey bees. We show that the emergent distributed cognition (sometimes referred to as the $\textit{hive mind}$) arising from individual bees following simple, local imitation-based rules is that of a single online reinforcement learning (RL) agent interacting with many parallel environments. The update rule through which this macro-agent learns is a bandit algorithm that we coin $\textit{Maynard-Cross Learning}$. Our analysis implies that a group of cognition-limited organisms can be equivalent to a more complex, reinforcement-enabled entity, substantiating the idea that group-level intelligence may explain how seemingly simple and blind individual behaviors are selected in nature. From a biological perspective, this analysis suggests how such imitation strategies evolved: they constitute a scalable form of reinforcement learning at the group level, aligning with theories of kin and group selection. Beyond biology, the framework offers new tools for analyzing economic and social systems where individuals imitate successful strategies, effectively participating in a collective learning process. In swarm intelligence, our findings will inform the design of scalable collective systems in artificial domains, enabling RL-inspired mechanisms for coordination and adaptability at scale.

[1079] Learning Closed-Loop Parametric Nash Equilibria of Multi-Agent Collaborative Field Coverage

Jushan Chen, Santiago Paternain

Main category: cs.MA

TL;DR: This paper formulates multi-agent collaborative field coverage as a Markov Potential Game, enabling efficient learning of Nash Equilibrium through equivalent single-objective optimal control, achieving 10x faster training and faster convergence.

DetailsMotivation: Multi-agent reinforcement learning faces challenges with nonstationarity and agent coupling. Markov Potential Games offer a way to simplify these complex interactions by reducing them to single-objective problems.

Method: The authors prove that multi-agent collaborative field coverage can be formulated as a Markov Potential Game, allowing them to learn parameterized closed-loop Nash Equilibrium by solving an equivalent single-objective optimal control problem.

Result: The proposed algorithm achieves 10x faster training compared to game-theoretic baselines and demonstrates faster convergence during policy execution.

Conclusion: Formulating multi-agent collaborative problems as Markov Potential Games enables efficient learning through equivalent single-objective optimization, significantly improving training speed and convergence performance.

Abstract: Multi-agent reinforcement learning is a challenging and active field of research due to the inherent nonstationary property and coupling between agents. A popular approach to modeling the multi-agent interactions underlying the multi-agent RL problem is the Markov Game. There is a special type of Markov Game, termed Markov Potential Game, which allows us to reduce the Markov Game to a single-objective optimal control problem where the objective function is a potential function. In this work, we prove that a multi-agent collaborative field coverage problem, which is found in many engineering applications, can be formulated as a Markov Potential Game, and we can learn a parameterized closed-loop Nash Equilibrium by solving an equivalent single-objective optimal control problem. As a result, our algorithm is 10x faster during training compared to a game-theoretic baseline and converges faster during policy execution.

[1080] Who’s the Mole? Modeling and Detecting Intention-Hiding Malicious Agents in LLM-Based Multi-Agent Systems

Yizhe Xie, Congcong Zhu, Xinyue Zhang, Tianqing Zhu, Dayong Ye, Minghao Wang, Chi Liu

Main category: cs.MA

TL;DR: This paper studies intention-hiding threats in LLM-powered multi-agent systems, proposes four stealthy attack paradigms, and develops AgentXposed - a psychology-inspired detection framework that combines personality modeling and interrogation techniques to identify malicious agents.

DetailsMotivation: Existing research on LLM-based agents has primarily focused on single-agent scenarios, leaving the security of multi-agent systems largely unexplored despite their demonstrated capabilities in collaborative problem-solving.

Method: Designed four representative attack paradigms that subtly disrupt task completion while maintaining stealth, and proposed AgentXposed - a detection framework combining HEXACO personality modeling and Reid interrogation technique with progressive questionnaire probing and behavior-based monitoring.

Result: Experimental results show the proposed attacks are highly disruptive and evade existing defenses, while AgentXposed effectively detects diverse malicious behaviors across six datasets against both proposed attacks and baseline threats, achieving strong robustness across multiple communication settings.

Conclusion: The study addresses a critical security gap in LLM-MAS by systematically analyzing intention-hiding threats and providing an effective psychology-inspired detection solution that can proactively identify malicious agents before harmful actions occur.

Abstract: Multi-agent systems powered by Large Language Models (LLM-MAS) have demonstrated remarkable capabilities in collaborative problem-solving. However, their deployment also introduces new security risks. Existing research on LLM-based agents has primarily examined single-agent scenarios, while the security of multi-agent systems remains largely unexplored. To address this gap, we present a systematic study of intention-hiding threats in LLM-MAS. We design four representative attack paradigms that subtly disrupt task completion while maintaining a high degree of stealth, and evaluate them under centralized, decentralized, and layered communication structures. Experimental results show that these attacks are highly disruptive and can easily evade existing defense mechanisms. To counter these threats, we propose AgentXposed, a psychology-inspired detection framework. AgentXposed draws on the HEXACO personality model, which characterizes agents through psychological trait dimensions, and the Reid interrogation technique, a structured method for eliciting concealed intentions. By combining progressive questionnaire probing with behavior-based inter-agent monitoring, the framework enables the proactive identification of malicious agents before harmful actions are carried out. Extensive experiments across six datasets against both our proposed attacks and two baseline threats demonstrate that AgentXposed effectively detects diverse forms of malicious behavior, achieving strong robustness across multiple communication settings.

[1081] RobustFlow: Towards Robust Agentic Workflow Generation

Shengxiang Xu, Jiayi Zhang, Shimin Di, Yuyu Luo, Liang Yao, Hanmo Liu, Jia Zhu, Fan Liu, Min-Ling Zhang

Main category: cs.MA

TL;DR: The paper identifies brittleness in LLM-generated agentic workflows when faced with semantically identical but differently phrased instructions, proposes metrics to evaluate this instability, and introduces RobustFlow - a training framework that improves workflow robustness to 70-90% through preference optimization.

DetailsMotivation: Current agentic workflow generation methods produce inconsistent results for semantically identical instructions with different phrasing, which undermines reliability for real-world applications.

Method: Proposed nodal and topological similarity metrics to evaluate workflow consistency, and developed RobustFlow - a training framework using preference optimization to teach models invariance to instruction variations by training on sets of synonymous task descriptions.

Result: RobustFlow significantly improves workflow robustness scores to 70-90%, representing substantial improvement over existing approaches.

Conclusion: The proposed RobustFlow framework effectively addresses the critical challenge of workflow brittleness in LLM-generated agentic workflows, making them more reliable and trustworthy for practical applications.

Abstract: The automated generation of agentic workflows is a promising frontier for enabling large language models (LLMs) to solve complex tasks. However, our investigation reveals that the robustness of agentic workflow remains a critical, unaddressed challenge. Current methods often generate wildly inconsistent workflows when provided with instructions that are semantically identical but differently phrased. This brittleness severely undermines their reliability and trustworthiness for real-world applications. To quantitatively diagnose this instability, we propose metrics based on nodal and topological similarity to evaluate workflow consistency against common semantic variations such as paraphrasing and noise injection. Subsequently, we further propose a novel training framework, RobustFlow, that leverages preference optimization to teach models invariance to instruction variations. By training on sets of synonymous task descriptions, RobustFlow boosts workflow robustness scores to 70% - 90%, which is a substantial improvement over existing approaches. The code is publicly available at https://github.com/DEFENSE-SEU/RobustFlow.

cs.MM

[1082] FinCall-Surprise: A Large Scale Multi-modal Benchmark for Earning Surprise Prediction

Dong Shu, Yanguang Liu, Huopu Zhang, Mengnan Du

Main category: cs.MM

TL;DR: FinCall-Surprise is the first large-scale open-source multi-modal dataset for earnings surprise prediction, featuring conference call transcripts, audio recordings, and presentation slides. Evaluation of 26 LLMs reveals performance illusions from class imbalance, weaknesses in financial models, and limited effectiveness of multi-modal integration.

DetailsMotivation: To address the limitation of relying on expensive, proprietary, text-only data for earnings surprise prediction, which has constrained the development of advanced models in this profitable domain.

Method: Created FinCall-Surprise dataset with 2,688 corporate conference calls (2019-2021) containing transcripts, audio, and slides. Evaluated 26 state-of-the-art unimodal and multi-modal LLMs to establish comprehensive benchmarks.

Result: (1) High accuracy often illusory due to class imbalance; (2) Specialized financial models show unexpected weaknesses in instruction-following; (3) Multi-modal integration provides limited gains as models struggle to effectively leverage audio/visual signals.

Conclusion: Existing LLMs have critical limitations in financial reasoning capabilities, and FinCall-Surprise establishes a challenging new baseline for future research in earnings surprise prediction.

Abstract: Predicting corporate earnings surprises is a profitable yet challenging task, as accurate forecasts can inform significant investment decisions. However, progress in this domain has been constrained by a reliance on expensive, proprietary, and text-only data, limiting the development of advanced models. To address this gap, we introduce \textbf{FinCall-Surprise} (Financial Conference Call for Earning Surprise Prediction), the first large-scale, open-source, and multi-modal dataset for earnings surprise prediction. Comprising 2,688 unique corporate conference calls from 2019 to 2021, our dataset features word-to-word conference call textual transcripts, full audio recordings, and corresponding presentation slides. We establish a comprehensive benchmark by evaluating 26 state-of-the-art unimodal and multi-modal LLMs. Our findings reveal that (1) while many models achieve high accuracy, this performance is often an illusion caused by significant class imbalance in the real-world data. (2) Some specialized financial models demonstrate unexpected weaknesses in instruction-following and language generation. (3) Although incorporating audio and visual modalities provides some performance gains, current models still struggle to leverage these signals effectively. These results highlight critical limitations in the financial reasoning capabilities of existing LLMs and establish a challenging new baseline for future research.

[1083] Evaluating Keyframe Layouts for Visual Known-Item Search in Homogeneous Collections

Bastian Jäckl, Jiří Kruchina, Lucas Joos, Daniel A. Keim, Ladislav Peška, Jakub Lokoč

Main category: cs.MM

TL;DR: The paper evaluates seven keyframe layouts for video retrieval and finds that video-grouped layouts are most efficient while rank-preserving grids achieve highest accuracy, motivating hybrid designs.

DetailsMotivation: Current multimodal deep-learning models rank keyframes for video retrieval, but users still need to manually browse ranked candidates, and keyframe arrangement in search grids significantly affects browsing effectiveness and user efficiency, which remains underexplored.

Method: Conducted a study with 49 participants evaluating seven different keyframe layouts for Visual Known-Item Search task, analyzing efficiency, accuracy, and browsing phenomena like overlooks.

Result: Video-grouped layout was most efficient, four-column rank-preserving grid achieved highest accuracy. Sorted grids enable rapid scanning but down-rank relevant targets, delaying first arrival times and increasing overlooks.

Conclusion: Findings motivate hybrid designs that preserve positions of top-ranked items while sorting or grouping the remainder, offering guidance for grid-based search beyond video retrieval.

Abstract: Multimodal deep-learning models power interactive video retrieval by ranking keyframes in response to textual queries. Despite these advances, users must still browse ranked candidates manually to locate a target. Keyframe arrangement within the search grid highly affects browsing effectiveness and user efficiency, yet remains underexplored. We report a study with 49 participants evaluating seven keyframe layouts for the Visual Known-Item Search task. Beyond efficiency and accuracy, we relate browsing phenomena, such as overlooks, to layout characteristics. Our results show that a video-grouped layout is the most efficient, while a four-column, rank-preserving grid achieves the highest accuracy. Sorted grids reveal potentials and trade-offs, enabling rapid scanning of uninteresting regions but down-ranking relevant targets to less prominent positions, delaying first arrival times and increasing overlooks. These findings motivate hybrid designs that preserve positions of top-ranked items while sorting or grouping the remainder, and offer guidance for searching in grids beyond video retrieval.

[1084] Synthesizing Sentiment-Controlled Feedback For Multimodal Text and Image Data

Puneet Kumar, Sarthak Malik, Balasubramanian Raman, Xiaobai Li

Main category: cs.MM

TL;DR: A system for generating sentiment-controlled feedback from multimodal text and image inputs, using a transformer and Faster R-CNN for feature extraction, achieving 77.23% sentiment classification accuracy.

DetailsMotivation: To enable empathetic and engaging human-computer interaction by providing sentiment-controlled responses for applications in education, healthcare, marketing, and customer service.

Method: Built a controllable feedback synthesis system with encoder, decoder, and controllability blocks using transformer and Faster R-CNN networks to extract and combine textual and visual features.

Result: Achieved 77.23% sentiment classification accuracy, which is 18.82% higher than without controllability, using the CMFeed dataset containing images, texts, reactions, and human comments.

Conclusion: The proposed system successfully generates sentiment-controlled multimodal feedback with significantly improved accuracy, and the CMFeed dataset and code are publicly available.

Abstract: The ability to generate sentiment-controlled feedback in response to multimodal inputs comprising text and images addresses a critical gap in human-computer interaction. This capability allows systems to provide empathetic, accurate, and engaging responses, with useful applications in education, healthcare, marketing, and customer service. To this end, we have constructed a large-scale Controllable Multimodal Feedback Synthesis (CMFeed) dataset and proposed a controllable feedback synthesis system. The system features an encoder, decoder, and controllability block for textual and visual inputs. It extracts features using a transformer and a Faster R-CNN network, combining them to generate feedback. The CMFeed dataset includes images, texts, reactions to the posts, human comments with relevance scores, and reactions to these comments. These reactions train the model to produce feedback with specified sentiments, achieving a sentiment classification accuracy of 77.23%, which is 18.82% higher than the accuracy without controllability. Access to the CMFeed dataset and the system’s code is available at https://github.com/MIntelligence-Group/CMFeed.

[1085] Cap2Sum: Learning to Summarize Videos by Generating Captions

Cairong Zhao, Chutian Wang, Zifan Song, Guosheng Hu, Haonan Chen, Xiaofan Zhai

Main category: cs.MM

TL;DR: Cap2Sum is a weakly-supervised video summarization model that uses dense video captions as supervision, achieving better performance and generalization through CLIP Prior mechanism and training on large-scale datasets.

DetailsMotivation: Video summarization faces limited performance due to small-scale datasets from high labeling costs. Using dense video captions as supervision enables training on larger datasets for better generalization.

Method: Proposes Cap2Sum model that learns video summarization by generating captions, uses CLIP Prior to enhance important object learning, and can perform zero-shot summarization or fine-tuning with ground-truth summaries/captions.

Result: Significant improvements in performance and generalization capacity compared to previous methods, demonstrated through extensive experiments on new datasets TVSum-Caption and SumMe-Caption.

Conclusion: Using dense video captions as weak supervision enables effective video summarization training on large-scale datasets, overcoming limitations of small labeled datasets while achieving superior performance and generalization.

Abstract: With the rapid growth of video data on the internet, video summarization is becoming a very important AI technology. However, due to the high labelling cost of video summarization, existing studies have to be conducted on small-scale datasets, leading to limited performance and generalization capacity. In this work, we introduce the use of dense video captions as a supervision signal to train video summarization models. Motivated by this, we propose Cap2Sum, a model that learns to summarize videos by generating captions, to exploit dense video caption annotations. This weakly-supervised approach allows us to train the models on large-scale dense video caption datasets to achieve better performance and generalization capacity. To further improve the generalization capacity, we introduce a CLIP (a strong vision-language model) Prior mechanism to enhance the learning of important objects that captions may ignore in the videos. In practice, Cap2Sum can perform zero-shot video summarization or be fine-tuned by the ground-truth summary or video caption of the target dataset. To examine the performance of Cap2Sum after weakly-supervised fine-tuning by the video captions, we propose two new datasets, TVSum-Caption and SumMe-Caption, which are derived from two common video summarization datasets and will be publicly released. We conduct extensive experiments and the results demonstrate that our method achieves significant improvements in performance and generalization capacity compared with previous methods.

[1086] Fact-Checking at Scale: Multimodal AI for Authenticity and Context Verification in Online Media

Van-Hoang Phan, Tung-Duong Le-Duc, Long-Khanh Pham, Anh-Thu Le, Quynh-Huong Dinh-Nguyen, Dang-Quan Vo, Hoang-Quoc Nguyen-Son, Anh-Duy Tran, Dang Vu, Minh-Son Dao

Main category: cs.MM

TL;DR: A comprehensive multimedia verification system for detecting misinformation in multilingual settings, integrating visual forensics, textual analysis, and multimodal reasoning with hybrid OOC detection methods.

DetailsMotivation: The proliferation of multimedia content on social media enables rapid spread of misinformation during crises, intensified by synthetic media and out-of-context content reuse, creating urgent need for robust verification tools.

Method: Unified verification pipeline integrating visual forensics, textual analysis, and multimodal reasoning; hybrid approach for out-of-context media detection using semantic similarity, temporal alignment, and geolocation cues.

Result: Extensive evaluations on ACM Multimedia 2025 Grand Challenge benchmark demonstrate system effectiveness across diverse real-world scenarios.

Conclusion: The system advances state of the art in multimedia verification and provides practical tools for journalists, fact-checkers, and researchers to address information integrity challenges.

Abstract: The proliferation of multimedia content on social media platforms has dramatically transformed how information is consumed and disseminated. While this shift enables real-time coverage of global events, it also facilitates the rapid spread of misinformation and disinformation, especially during crises such as wars, natural disasters, or elections. The rise of synthetic media and the reuse of authentic content in misleading contexts have intensified the need for robust multimedia verification tools. In this paper, we present a comprehensive system developed for the ACM Multimedia 2025 Grand Challenge on Multimedia Verification. Our system assesses the authenticity and contextual accuracy of multimedia content in multilingual settings and generates both expert-oriented verification reports and accessible summaries for the general public. We introduce a unified verification pipeline that integrates visual forensics, textual analysis, and multimodal reasoning, and propose a hybrid approach to detect out-of-context (OOC) media through semantic similarity, temporal alignment, and geolocation cues. Extensive evaluations on the Grand Challenge benchmark demonstrate the system’s effectiveness across diverse real-world scenarios. Our contributions advance the state of the art in multimedia verification and offer practical tools for journalists, fact-checkers, and researchers confronting information integrity challenges in the digital age.

[1087] Comparing Contrastive and Triplet Loss: Variance Analysis and Optimization Behavior

Donghuo Zeng

Main category: cs.MM

TL;DR: Theoretical and empirical comparison shows triplet loss preserves greater intra- and inter-class variance for finer-grained distinctions, while contrastive loss compacts intra-class embeddings and may obscure subtle semantic differences.

DetailsMotivation: To understand the effects of contrastive loss and triplet loss on representation quality in deep metric learning, as their impacts remain insufficiently understood.

Method: Theoretical analysis and empirical experiments on synthetic data and real datasets (MNIST, CIFAR-10) examining intra-/inter-class variance, optimization behavior (loss-decay rate, active ratio, gradient norm), and performance on classification and retrieval tasks across multiple datasets.

Result: Triplet loss preserves greater variance within and across classes, produces fewer but stronger updates that sustain learning on hard examples, and consistently yields superior performance across classification and retrieval tasks on MNIST, CIFAR-10, CUB-200, and CARS196 datasets.

Conclusion: Triplet loss is recommended for detail retention and hard-sample focus, while contrastive loss is better for smoother, broad-based embedding refinement.

Abstract: Contrastive loss and triplet loss are widely used objectives in deep metric learning, yet their effects on representation quality remain insufficiently understood. We present a theoretical and empirical comparison of these losses, focusing on intra- and inter-class variance and optimization behavior (e.g., greedy updates). Through task-specific experiments with consistent settings on synthetic data and real datasets-MNIST, CIFAR-10-it is shown that triplet loss preserves greater variance within and across classes, supporting finer-grained distinctions in the learned representations. In contrast, contrastive loss tends to compact intra-class embeddings, which may obscure subtle semantic differences. To better understand their optimization dynamics, By examining loss-decay rate, active ratio, and gradient norm, we find that contrastive loss drives many small updates early on, while triplet loss produces fewer but stronger updates that sustain learning on hard examples. Finally, across both classification and retrieval tasks on MNIST, CIFAR-10, CUB-200, and CARS196 datasets, our results consistently show that triplet loss yields superior performance, which suggests using triplet loss for detail retention and hard-sample focus, and contrastive loss for smoother, broad-based embedding refinement.

eess.AS

[1088] Scaling Multi-Talker ASR with Speaker-Agnostic Activity Streams

Xiluo He, Alexander Polok, Jesús Villalba, Thomas Thebaud, Matthew Maciejewski

Main category: eess.AS

TL;DR: Proposes a method to decouple multi-talker ASR inference cost from speaker count by converting speaker-specific activity into speaker-agnostic streams, maintaining performance while reducing runtime.

DetailsMotivation: Current multi-talker ASR systems require running ASR model once per speaker, making inference costs scale with speaker count and limiting practicality.

Method: Convert speaker-specific activity outputs into two speaker-agnostic streams using heuristics that preserve conversational continuity and maintain compatibility with existing ASR systems.

Result: Compatible with Diarization-Conditioned Whisper (DiCoW), greatly reduces runtimes on AMI and ICSI meeting datasets while retaining competitive performance.

Conclusion: The approach successfully decouples inference cost from speaker count, making multi-talker ASR more practical while maintaining recognition quality.

Abstract: An increasingly common training paradigm for multi-talker automatic speech recognition (ASR) is to use speaker activity signals to adapt single-speaker ASR models for overlapping speech. Although effective, these systems require running the ASR model once per speaker, resulting in inference costs that scale with the number of speakers and limiting their practicality. In this work, we propose a method that decouples the inference cost of activity-conditioned ASR systems from the number of speakers by converting speaker-specific activity outputs into two speaker-agnostic streams. A central challenge is that na"ively merging speaker activities into streams significantly degrades recognition, since pretrained ASR models assume contiguous, single-speaker inputs. To address this, we design new heuristics aimed at preserving conversational continuity and maintaining compatibility with existing systems. We show that our approach is compatible with Diarization-Conditioned Whisper (DiCoW) to greatly reduce runtimes on the AMI and ICSI meeting datasets while retaining competitive performance.

[1089] Adapting Diarization-Conditioned Whisper for End-to-End Multi-Talker Speech Recognition

Martin Kocour, Martin Karafiat, Alexander Polok, Dominik Klement, Lukáš Burget, Jan Černocký

Main category: eess.AS

TL;DR: A speaker-attributed Whisper model that combines target-speaker modeling with serialized output training for multi-talker speech recognition, achieving better performance than existing approaches.

DetailsMotivation: To improve multi-talker speech recognition by enabling joint decoding that considers all speakers' context simultaneously, rather than decoding each speaker separately.

Method: Uses Diarization-Conditioned Whisper (DiCoW) encoder to extract target-speaker embeddings, concatenates them into a single representation, and passes to a shared decoder for serialized output with speaker tags and timestamps.

Result: Outperforms existing SOT-based approaches and surpasses DiCoW on multi-talker mixtures like LibriMix.

Conclusion: The proposed speaker-attributed Whisper model with joint decoding effectively handles overlapping speech and provides better transcription accuracy for multi-talker scenarios.

Abstract: We propose a speaker-attributed (SA) Whisper-based model for multi-talker speech recognition that combines target-speaker modeling with serialized output training (SOT). Our approach leverages a Diarization-Conditioned Whisper (DiCoW) encoder to extract target-speaker embeddings, which are concatenated into a single representation and passed to a shared decoder. This enables the model to transcribe overlapping speech as a serialized output stream with speaker tags and timestamps. In contrast to target-speaker ASR systems such as DiCoW, which decode each speaker separately, our approach performs joint decoding, allowing the decoder to condition on the context of all speakers simultaneously. Experiments show that the model outperforms existing SOT-based approaches and surpasses DiCoW on multi-talker mixtures (e.g., LibriMix).

[1090] A MATLAB toolbox for Computation of Speech Transmission Index (STI)

Pavel Rajmic, Jiří Schimmel, Šimon Cieslar

Main category: eess.AS

TL;DR: Open source Matlab implementation of Speech Transmission Index (STI) following IEC 60268-16:2020 standard, including both direct/indirect methods and STIPA protocol.

DetailsMotivation: Reliable STI implementations are not publicly accessible and often limited to proprietary hardware, creating barriers for speech intelligibility research.

Method: Developed Matlab implementation of STI computation following IEC standard, including direct/indirect approaches and shortened STIPA protocol, with verification against reference signals and commercial devices.

Result: Implementation meets prescribed requirements, verified through tests on reference signals and comparison with commercial measurement device.

Conclusion: Provides open source STI computation tool that overcomes limitations of proprietary implementations, enabling broader access to standardized speech intelligibility assessment.

Abstract: The speech transmission index (STI) is a popular simple metric for the prediction of speech intelligibility when speech is passed through a transmission channel. Computation of STI from acoustic measurements is described in the IEC 60268-16:2020 standard. Though, reliable implementations of STI are not publicly accessible and are frequently limited to the use with a proprietary measurement hardware. We present a Matlab STI implementation of both the direct and indirect approaches according to the standard, including the shortened STIPA protocol. The suggested implementation meets prescribed requirements, as evidenced by tests on reference signals. Additionally, we conducted a verification measurement in comparison to a commercial measurement device. Our software comes with open source code.

[1091] A Multilingual Framework for Dysarthria: Detection, Severity Classification, Speech-to-Text, and Clean Speech Generation

Ananya Raghu, Anisha Raghu, Nithika Vivek, Sofie Budman, Omar Mansour

Main category: eess.AS

TL;DR: A unified AI framework for dysarthria speech analysis with 97% accuracy in detection/classification, cross-lingual transfer learning, and speech reconstruction capabilities across English, Russian, and German.

DetailsMotivation: Dysarthria causes communication barriers but current tools lack generalizability across languages and severity levels, requiring a comprehensive multilingual solution.

Method: Used spectrogram-based visualizations and acoustic feature extraction to train models on English, Russian, and German datasets for six components: detection, classification, speech generation, speech-to-text, emotion detection, and voice cloning.

Result: Achieved 97% accuracy in binary detection and severity classification across all languages. Speech reconstruction had low L1 losses (0.02-0.06), and speech-to-text achieved 0.1367 WER. Cross-lingual transfer learning improved performance in low-resource settings.

Conclusion: The framework successfully addresses dysarthria diagnosis and communication improvement across multiple languages, demonstrating strong generalization and cross-lingual transfer capabilities.

Abstract: Dysarthria is a motor speech disorder that results in slow and often incomprehensible speech. Speech intelligibility significantly impacts communication, leading to barriers in social interactions. Dysarthria is often a characteristic of neurological diseases including Parkinson’s and ALS, yet current tools lack generalizability across languages and levels of severity. In this study, we present a unified AI-based multilingual framework that addresses six key components: (1) binary dysarthria detection, (2) severity classification, (3) clean speech generation, (4) speech-to-text conversion, (5) emotion detection, and (6) voice cloning. We analyze datasets in English, Russian, and German, using spectrogram-based visualizations and acoustic feature extraction to inform model training. Our binary detection model achieved 97% accuracy across all three languages, demonstrating strong generalization across languages. The severity classification model also reached 97% test accuracy, with interpretable results showing model attention focused on lower harmonics. Our translation pipeline, trained on paired Russian dysarthric and clean speech, reconstructed intelligible outputs with low training (0.03) and test (0.06) L1 losses. Given the limited availability of English dysarthric-clean pairs, we fine-tuned the Russian model on English data and achieved improved losses of 0.02 (train) and 0.03 (test), highlighting the promise of cross-lingual transfer learning for low-resource settings. Our speech-to-text pipeline achieved a Word Error Rate of 0.1367 after three epochs, indicating accurate transcription on dysarthric speech and enabling downstream emotion recognition and voice cloning from transcribed speech. Overall, the results and products of this study can be used to diagnose dysarthria and improve communication and understanding for patients across different languages.

[1092] MoME: Mixture of Matryoshka Experts for Audio-Visual Speech Recognition

Umberto Cappellazzo, Minsu Kim, Pingchuan Ma, Honglie Chen, Xubo Liu, Stavros Petridis, Maja Pantic

Main category: eess.AS

TL;DR: MoME integrates sparse Mixture-of-Experts into Matryoshka representation learning for audio-visual speech recognition, enabling dynamic capacity allocation across token granularities and achieving state-of-the-art performance with fewer parameters.

DetailsMotivation: Address limitations of current LLM-based AVSR approaches: high computational demands, fixed token compression rates, and independent scale training in MRL methods that limit cross-scale generalization and robustness.

Method: Propose MoME framework that augments frozen LLM with top-k routed and shared experts, using shared router for consistent expert activation across granularities to allow compressed sequences to benefit from lower compression representations.

Result: Achieves state-of-the-art performance on LRS2 and LRS3 datasets across AVSR, ASR, and VSR tasks, requiring significantly fewer parameters while maintaining robustness under noise conditions.

Conclusion: MoME successfully unifies the adaptability of MRL with the efficiency of MoE, providing a scalable and interpretable solution for resource-aware speech recognition.

Abstract: Large language models (LLMs) have recently shown strong potential in audio-visual speech recognition (AVSR), but their high computational demands and sensitivity to token granularity limit their practicality in resource-constrained settings. Token compression methods can reduce inference cost, but they require fixing a compression rate in advance and produce a single fixed-length output, offering no flexibility to balance information density and efficiency at inference time. Matryoshka representation learning (MRL) addresses this by enabling a single model to operate across multiple token granularities, allowing compression rates to be adjusted dynamically. However, current MRL-based methods treat each scale independently during training, limiting cross-scale generalization, robustness at high compression, and interpretability. To overcome these limitations, we propose MoME (Mixture of Matryoshka Experts), a novel framework that integrates sparse Mixture-of-Experts (MoE) into MRL-based LLMs for AVSR. MoME augments a frozen LLM with top-k routed and shared experts, allowing dynamic capacity allocation across scales and modalities. A shared router promotes consistent expert activation across granularities, enabling compressed sequences to benefit from representations learned at lower compression. Experiments on LRS2 and LRS3 demonstrate that MoME achieves state-of-the-art performance across AVSR, ASR, and VSR tasks, while requiring significantly fewer parameters and maintaining robustness under noise. MoME unifies the adaptability of MRL with the efficiency of MoE, offering a scalable and interpretable solution for resource-aware speech recognition.

[1093] Drax: Speech Recognition with Discrete Flow Matching

Aviv Navon, Aviv Shamsian, Neta Glazer, Yael Segal-Feldman, Gill Hetz, Joseph Keshet, Ethan Fetaya

Main category: eess.AS

TL;DR: Drax is a discrete flow matching framework for ASR that enables efficient parallel decoding by constructing audio-conditioned probability paths that guide through likely intermediate inference errors rather than direct noise-to-target transitions.

DetailsMotivation: Diffusion and flow-based non-autoregressive models show promise in language modeling but remain largely unexplored for ASR. The paper aims to bridge the gap between training and inference in ASR models.

Method: Proposes Drax framework using discrete flow matching with audio-conditioned probability paths that simulate likely intermediate inference errors. Theoretical analysis links generalization gap to cumulative velocity errors.

Result: Achieves recognition accuracy on par with state-of-the-art speech models while offering improved accuracy-efficiency trade-offs.

Conclusion: Discrete flow matching is a promising direction for advancing non-autoregressive ASR systems.

Abstract: Diffusion and flow-based non-autoregressive (NAR) models have shown strong promise in large language modeling, however, their potential for automatic speech recognition (ASR) remains largely unexplored. We propose Drax, a discrete flow matching framework for ASR that enables efficient parallel decoding. To better align training with inference, we construct an audio-conditioned probability path that guides the model through trajectories resembling likely intermediate inference errors, rather than direct random noise to target transitions. Our theoretical analysis links the generalization gap to divergences between training and inference occupancies, controlled by cumulative velocity errors, thereby motivating our design choice. Empirical evaluation demonstrates that our approach attains recognition accuracy on par with state-of-the-art speech models while offering improved accuracy-efficiency trade-offs, highlighting discrete flow matching as a promising direction for advancing NAR ASR.

[1094] Enhancing Speaker Verification with w2v-BERT 2.0 and Knowledge Distillation guided Structured Pruning

Ze Li, Ming Cheng, Ming Li

Main category: eess.AS

TL;DR: The paper uses w2v-BERT 2.0, a large self-supervised pre-trained model, for speaker verification, achieving state-of-the-art results with efficient fine-tuning and model compression.

DetailsMotivation: To leverage large-scale self-supervised pre-trained models for speaker verification tasks, as they provide rich feature representations that can improve performance.

Method: Utilizes w2v-BERT 2.0 with MFA structure and Layer Adapter for processing multi-layer features, incorporates LoRA for efficient fine-tuning, and applies knowledge distillation guided structured pruning for model compression.

Result: Achieves 0.12% and 0.55% EER on Vox1-O and Vox1-H test sets respectively, and reduces model size by 80% with only 0.04% EER degradation through pruning.

Conclusion: The approach demonstrates that large pre-trained models can achieve state-of-the-art speaker verification performance while being efficiently compressed through knowledge distillation and pruning techniques.

Abstract: Large-scale self-supervised Pre-Trained Models (PTMs) have shown significant improvements in the speaker verification (SV) task by providing rich feature representations. In this paper, we utilize w2v-BERT 2.0, a model with approximately 600 million parameters trained on 450 million hours of unlabeled data across 143 languages, for the SV task. The MFA structure with Layer Adapter is employed to process the multi-layer feature outputs from the PTM and extract speaker embeddings. Additionally, we incorporate LoRA for efficient fine-tuning. Our model achieves state-of-the-art results with 0.12% and 0.55% EER on the Vox1-O and Vox1-H test sets, respectively. Furthermore, we apply knowledge distillation guided structured pruning, reducing the model size by 80% while achieving only a 0.04% EER degradation. Source code and models are released at https://github.com/ZXHY-82/w2v-BERT-2.0_SV.

[1095] Probing Whisper for Dysarthric Speech in Detection and Assessment

Zhengjun Yue, Devendra Kayande, Zoran Cvetkovic, Erfan Loweimi

Main category: eess.AS

TL;DR: Probing Whisper-Medium model encoder layers reveals mid-level layers (13-15) are most informative for dysarthric speech detection and severity classification, with fine-tuning showing minimal impact.

DetailsMotivation: Understanding how large-scale speech models like Whisper represent pathological speech is crucial for developing reliable clinical assessment tools, but their internal behavior on dysarthric speech remains poorly understood.

Method: Used linear classifiers with layer-wise embeddings under single-task and multi-task settings, complemented with Silhouette scores and mutual information analysis. Also examined adaptability by fine-tuning Whisper on dysarthric speech recognition.

Result: Mid-level encoder layers (13-15) consistently emerged as most informative across all metrics. Fine-tuning induced only modest changes to the layer-wise patterns.

Conclusion: The findings improve interpretability of Whisper’s embeddings and demonstrate the value of probing analyses for guiding the use of large-scale pretrained models in pathological speech applications.

Abstract: Large-scale end-to-end models such as Whisper have shown strong performance on diverse speech tasks, but their internal behavior on pathological speech remains poorly understood. Understanding how dysarthric speech is represented across layers is critical for building reliable and explainable clinical assessment tools. This study probes the Whisper-Medium model encoder for dysarthric speech for detection and assessment (i.e., severity classification). We evaluate layer-wise embeddings with a linear classifier under both single-task and multi-task settings, and complement these results with Silhouette scores and mutual information to provide perspectives on layer informativeness. To examine adaptability, we repeat the analysis after fine-tuning Whisper on a dysarthric speech recognition task. Across metrics, the mid-level encoder layers (13-15) emerge as most informative, while fine-tuning induces only modest changes. The findings improve the interpretability of Whisper’s embeddings and highlight the potential of probing analyses to guide the use of large-scale pretrained models for pathological speech.

[1096] Differentiable physics for sound field reconstruction

Samuel A. Verburg, Efren Fernandez-Grande, Peter Gerstoft

Main category: eess.AS

TL;DR: A differentiable physics approach for sound field reconstruction using neural networks to approximate wave equation initial conditions, with physics enforced as strong constraints through numerical solvers and sparsity-promoting regularization.

DetailsMotivation: To address sound field reconstruction from limited observations by overcoming limitations of conventional physics-informed neural networks that treat physics as soft constraints in loss functions, which can lead to unstable training and poor performance under severe undersampling.

Method: Approximates initial conditions of wave equation with neural network, computes differential operator with differentiable numerical solver (enforcing physics as strong constraint), and adds sparsity-promoting constraint for severe undersampling scenarios.

Result: Achieves higher accuracy and better convergence compared to physics-informed neural networks, successfully reconstructs sound fields under extreme data scarcity conditions.

Conclusion: Differentiable physics approach with strong physics constraints and sparsity regularization enables stable training and effective sound field reconstruction even with severely limited observations.

Abstract: Sound field reconstruction involves estimating sound fields from a limited number of spatially distributed observations. This work introduces a differentiable physics approach for sound field reconstruction, where the initial conditions of the wave equation are approximated with a neural network, and the differential operator is computed with a differentiable numerical solver. The use of a numerical solver enables a stable network training while enforcing the physics as a strong constraint, in contrast to conventional physics-informed neural networks, which include the physics as a constraint in the loss function. We introduce an additional sparsity-promoting constraint to achieve meaningful solutions even under severe undersampling conditions. Experiments demonstrate that the proposed approach can reconstruct sound fields under extreme data scarcity, achieving higher accuracy and better convergence compared to physics-informed neural networks.

[1097] UniVoice: Unifying Autoregressive ASR and Flow-Matching based TTS with Large Language Models

Wenhao Guan, Zhikang Niu, Ziyue Jiang, Kaidi Wang, Peijie Chen, Qingyang Hong, Lin Li, Xie Chen

Main category: eess.AS

TL;DR: UniVoice is a unified LLM framework that integrates speech recognition and synthesis using continuous representations, achieving state-of-the-art performance in both tasks through a dual attention mechanism and flow matching.

DetailsMotivation: Current approaches handle ASR and TTS separately, lacking a unified framework. Discrete speech tokenization causes information loss, limiting performance in both recognition and generation tasks.

Method: Combines autoregressive modeling for speech recognition with flow matching for generation, using a dual attention mechanism that switches between causal masks (recognition) and bidirectional masks (synthesis). Includes text-prefix-conditioned speech infilling for zero-shot voice cloning.

Result: Achieves or exceeds current single-task modeling methods in both ASR and zero-shot TTS tasks, demonstrating high-fidelity voice cloning capabilities.

Conclusion: UniVoice explores new possibilities for end-to-end speech understanding and generation through unified modeling of recognition and synthesis tasks.

Abstract: Large language models (LLMs) have demonstrated promising performance in both automatic speech recognition (ASR) and text-to-speech (TTS) systems, gradually becoming the mainstream approach. However, most current approaches address these tasks separately rather than through a unified framework. This work aims to integrate these two tasks into one unified model. Although discrete speech tokenization enables joint modeling, its inherent information loss limits performance in both recognition and generation. In this work, we present UniVoice, a unified LLM framework through continuous representations that seamlessly integrates speech recognition and synthesis within a single model. Our approach combines the strengths of autoregressive modeling for speech recognition with flow matching for high-quality generation. To mitigate the inherent divergence between autoregressive and flow-matching models, we further design a dual attention mechanism, which switches between a causal mask for recognition and a bidirectional attention mask for synthesis. Furthermore, the proposed text-prefix-conditioned speech infilling method enables high-fidelity zero-shot voice cloning. Experimental results demonstrate that our method can achieve or exceed current single-task modeling methods in both ASR and zero-shot TTS tasks. This work explores new possibilities for end-to-end speech understanding and generation.

[1098] AURA Score: A Metric For Holistic Audio Question Answering Evaluation

Satvik Dixit, Soham Deshmukh, Bhiksha Raj

Main category: eess.AS

TL;DR: This paper introduces AQEval, the first benchmark for Audio Question Answering (AQA) metrics, and proposes AURA score, a new metric that significantly outperforms existing methods in correlating with human judgments.

DetailsMotivation: Existing AQA metrics (BLEU, METEOR, BERTScore) adapted from NLP and audio captioning fail to account for question context, reasoning, and partial correctness, creating a gap in evaluating open-ended responses.

Method: Created AQEval benchmark with 10k model responses annotated by humans, conducted comprehensive analysis of existing metrics, and developed AURA score as a new evaluation metric.

Result: AURA achieves state-of-the-art correlation with human ratings on AQEval, significantly outperforming all baseline metrics, especially for longer answers.

Conclusion: The work highlights limitations of current AQA evaluation methods and provides both AQEval benchmark and AURA metric to support future research in holistic AQA evaluation.

Abstract: Audio Question Answering (AQA) is a key task for evaluating Audio-Language Models (ALMs), yet assessing open-ended responses remains challenging. Existing metrics used for AQA such as BLEU, METEOR and BERTScore, mostly adapted from NLP and audio captioning, rely on surface similarity and fail to account for question context, reasoning, and partial correctness. To address the gap in literature, we make three contributions in this work. First, we introduce AQEval to enable systematic benchmarking of AQA metrics. It is the first benchmark of its kind, consisting of 10k model responses annotated by multiple humans for their correctness and relevance. Second, we conduct a comprehensive analysis of existing AQA metrics on AQEval, highlighting weak correlation with human judgment, especially for longer answers. Third, we propose a new metric - AURA score, to better evaluate open-ended model responses. On AQEval, AURA achieves state-of-the-art correlation with human ratings, significantly outperforming all baselines. Through this work, we aim to highlight the limitations of current AQA evaluation methods and motivate better metrics. We release both the AQEval benchmark and the AURA metric to support future research in holistic AQA evaluation.

[1099] Perceptual Evaluation of Extrapolated Spatial Room Impulse Responses From a Mono Source

Ben Heritage, Fiona Ryder, Michael McLoughlin, Karolina Prawda

Main category: eess.AS

TL;DR: This paper evaluates the plausibility of extrapolated spatial Room Impulse Responses (RIRs) using listening tests, showing that artificial RIRs can be perceived as real with only 38% detection accuracy.

DetailsMotivation: Current methods for creating plausible spatial audio in VR/AR require many acoustic measurements with specialized equipment, making the process time-consuming and expensive.

Method: Used 3-Alternative Forced Choice (3AFC) listening tests with RIRs from three spaces convolved with speech, orchestral, and instrumental music. Participants had to identify which of three stimuli (one extrapolated and two real) was artificial.

Result: Overall detection accuracy was 38% (only 5 percentage points above the expected guessing rate of 33%), indicating that extrapolated RIRs were often perceived as real.

Conclusion: It is possible to extrapolate plausible spatial RIRs from mono measurements, reducing the need for time-consuming acoustic measurements with specialized equipment.

Abstract: Immersion in virtual and augmented reality solutions is reliant on plausible spatial audio. However, plausibly representing a space for immersive audio often requires many individual acoustic measurements of source-microphone pairs with specialist spatial microphones, making the procedure time-consuming and expensive. In this study, we evaluate the plausibility of extrapolated and spatialised Room Impulse Responses (RIRs) by using a 3-Alternative Forced Choice (3AFC) listening test. The stimuli comprised of RIRs from three spaces convolved with speech, orchestral, and instrumental music. When asked to select which stimuli was artificial out of one extrapolated and two real stimuli, an overall accuracy of 38% was achieved from 20 participants (5 percentage points above the expected guessing rate). Given the listening test result, this study shows that it is possible to extrapolate plausible spatial RIRs from mono measurements, decreasing the need for time and specialist equipment in acoustic measurements.

[1100] MuFFIN: Multifaceted Pronunciation Feedback Model with Interactive Hierarchical Neural Modeling

Bi-Cheng Yan, Ming-Kang Tsai, Berlin Chen

Main category: eess.AS

TL;DR: MuFFIN is a multi-faceted pronunciation feedback model that jointly addresses mispronunciation detection and diagnosis (MDD) and automatic pronunciation assessment (APA) through an interactive hierarchical neural architecture with phoneme-contrastive ordinal regularization and imbalance-aware training.

DetailsMotivation: Existing CAPT methods treat MDD and APA as independent tasks despite their natural complementarity, leading to disparate modeling paradigms that don't leverage their synergistic potential.

Method: Proposed MuFFIN with interactive hierarchical neural architecture, phoneme-contrastive ordinal regularization for discriminative features, and imbalance-aware training objective using phoneme-specific variations to handle data imbalance in MDD.

Result: State-of-the-art performance on both APA and MDD tasks on the Speechocean762 benchmark dataset, outperforming several cutting-edge baselines.

Conclusion: The joint modeling approach effectively addresses both pronunciation assessment and error diagnosis, demonstrating the efficacy of integrating MDD and APA tasks through the proposed multi-faceted framework.

Abstract: Computer-assisted pronunciation training (CAPT) manages to facilitate second-language (L2) learners to practice pronunciation skills by offering timely and instructive feedback. To examine pronunciation proficiency from multiple facets, existing methods for CAPT broadly fall into two categories: mispronunciation detection and diagnosis (MDD) as well as automatic pronunciation assessment (APA). The former aims to pinpoint phonetic pronunciation errors and provide diagnostic feedback, while the latter seeks instead to quantify pronunciation proficiency pertaining to various aspects. Despite the natural complementarity between MDD and APA, researchers and practitioners, however, often treat them as independent tasks with disparate modeling paradigms. In light of this, we in this paper first introduce MuFFIN, a Multi-Faceted pronunciation Feedback model with an Interactive hierarchical Neural architecture, to jointly address the tasks of MDD and APA. To better capture the nuanced distinctions between phonemes in the feature space, a novel phoneme-contrastive ordinal regularization mechanism is then put forward to optimize the proposed model to generate more phoneme-discriminative features while factoring in the ordinality of the aspect scores. In addition, to address the intricate data imbalance problem in MDD, we design a simple yet effective training objective, which is specifically tailored to perturb the outputs of a phoneme classifier with the phoneme-specific variations, so as to better render the distribution of predicted phonemes meanwhile considering their mispronunciation characteristics. A series of experiments conducted on the Speechocean762 benchmark dataset demonstrates the efficacy of our method in relation to several cutting-edge baselines, showing state-of-the-art performance on both the APA and MDD tasks.

[1101] Leveraging Self-Supervised Audio-Visual Pretrained Models to Improve Vocoded Speech Intelligibility in Cochlear Implant Simulation

Richard Lee Lai, Jen-Cheng Hou, I-Chun Chern, Kuo-Hsuan Hung, Yi-Ting Chen, Mandar Gogate, Tughrul Arslan, Amir Hussain, Yu Tsao

Main category: eess.AS

TL;DR: Proposes SSL-AVSE, a self-supervised learning framework for audio-visual speech enhancement that addresses limited training data by combining visual cues with audio signals using AV-HuBERT and BLSTM models, showing significant improvements in speech quality and intelligibility for cochlear implant users.

DetailsMotivation: Address challenges faced by hearing-impaired individuals in comprehending speech in noisy environments, particularly focusing on limited training data scenarios for audio-visual speech enhancement in cochlear implant simulations.

Method: Proposes SSL-AVSE framework that combines visual cues (lip/mouth movements) with audio signals, uses Transformer-based SSL AV-HuBERT model for feature extraction, and processes with BLSTM-based speech enhancement model. Fine-tunes AV-HuBERT parameters for target task.

Result: Significant performance improvements: PESQ increased from 1.43 to 1.67, STOI from 0.70 to 0.74. For CI vocoded speech with dynamic noises, NCM values improved by 26.5% to 87.2% compared to noisy baseline.

Conclusion: SSL-AVSE successfully overcomes limited data issues and substantially enhances speech intelligibility for cochlear implant users, particularly effective in noisy conversational environments.

Abstract: Individuals with hearing impairments face challenges in their ability to comprehend speech, particularly in noisy environments. The aim of this study is to explore the effectiveness of audio-visual speech enhancement (AVSE) in enhancing the intelligibility of vocoded speech in cochlear implant (CI) simulations. Notably, the study focuses on a challenged scenario where there is limited availability of training data for the AVSE task. To address this problem, we propose a novel deep neural network framework termed Self-Supervised Learning-based AVSE (SSL-AVSE). The proposed SSL-AVSE combines visual cues, such as lip and mouth movements, from the target speakers with corresponding audio signals. The contextually combined audio and visual data are then fed into a Transformer-based SSL AV-HuBERT model to extract features, which are further processed using a BLSTM-based SE model. The results demonstrate several key findings. Firstly, SSL-AVSE successfully overcomes the issue of limited data by leveraging the AV-HuBERT model. Secondly, by fine-tuning the AV-HuBERT model parameters for the target SE task, significant performance improvements are achieved. Specifically, there is a notable enhancement in PESQ (Perceptual Evaluation of Speech Quality) from 1.43 to 1.67 and in STOI (Short-Time Objective Intelligibility) from 0.70 to 0.74. Furthermore, the performance of the SSL-AVSE was evaluated using CI vocoded speech to assess the intelligibility for CI users. Comparative experimental outcomes reveal that in the presence of dynamic noises encountered during human conversations, SSL-AVSE exhibits a substantial improvement. The NCM (Normal Correlation Matrix) values indicate an increase of 26.5% to 87.2% compared to the noisy baseline.

[1102] Adversarial Attacks and Robust Defenses in Speaker Embedding based Zero-Shot Text-to-Speech System

Ze Li, Yao Shi, Yunfei Xu, Ming Li

Main category: eess.AS

TL;DR: This paper investigates defense strategies against adversarial attacks on speaker embedding based zero-shot TTS systems, focusing on adversarial training and adversarial purification to enhance security.

DetailsMotivation: Speaker embedding based zero-shot TTS systems are vulnerable to adversarial attacks that can manipulate synthesized speech to sound like another person, posing security risks including speaker identity spoofing and unauthorized voice manipulation.

Method: The paper explores two defense strategies: 1) adversarial training that integrates adversarial examples during training to improve model robustness, and 2) adversarial purification using diffusion probabilistic models to revert adversarially perturbed audio back to its clean form.

Result: Experimental results show that both defense mechanisms significantly reduce the impact of adversarial perturbations, improving the security of zero-shot TTS systems against adversarial attacks.

Conclusion: The proposed adversarial training and purification methods effectively enhance the security and reliability of speaker embedding based zero-shot TTS systems in adversarial environments.

Abstract: Speaker embedding based zero-shot Text-to-Speech (TTS) systems enable high-quality speech synthesis for unseen speakers using minimal data. However, these systems are vulnerable to adversarial attacks, where an attacker introduces imperceptible perturbations to the original speaker’s audio waveform, leading to synthesized speech sounds like another person. This vulnerability poses significant security risks, including speaker identity spoofing and unauthorized voice manipulation. This paper investigates two primary defense strategies to address these threats: adversarial training and adversarial purification. Adversarial training enhances the model’s robustness by integrating adversarial examples during the training process, thereby improving resistance to such attacks. Adversarial purification, on the other hand, employs diffusion probabilistic models to revert adversarially perturbed audio to its clean form. Experimental results demonstrate that these defense mechanisms can significantly reduce the impact of adversarial perturbations, enhancing the security and reliability of speaker embedding based zero-shot TTS systems in adversarial environments.

[1103] From Dialect Gaps to Identity Maps: Tackling Variability in Speaker Verification

Abdulhady Abas Abdullah, Soran Badawi, Dana A. Abdullah, Dana Rasul Hamad

Main category: eess.AS

TL;DR: This paper investigates the challenges of Kurdish speaker detection across multiple dialects (Kurmanji, Sorani, Hawrami) and proposes solutions using machine learning, data augmentation, and dialect-specific corpora to improve recognition accuracy.

DetailsMotivation: Kurdish language with its multiple dialects presents unique challenges for speaker recognition systems due to significant phonetic and lexical differences between dialects, making accurate speaker identification difficult across different dialect groups.

Method: The study uses sophisticated machine learning approaches, data augmentation strategies, and builds comprehensive dialect-specific corpora. It employs cross-dialect training and customized strategies for each specific dialect.

Result: The results demonstrate that tailored approaches for each dialect combined with cross-dialect training significantly improve speaker recognition performance across different Kurdish dialects.

Conclusion: Customized strategies for individual dialects along with cross-dialect training are essential for building robust speaker identification systems that can accurately identify speakers across multiple Kurdish dialects despite their phonetic and lexical differences.

Abstract: The complexity and difficulties of Kurdish speaker detection among its several dialects are investigated in this work. Because of its great phonetic and lexical differences, Kurdish with several dialects including Kurmanji, Sorani, and Hawrami offers special challenges for speaker recognition systems. The main difficulties in building a strong speaker identification system capable of precisely identifying speakers across several dialects are investigated in this work. To raise the accuracy and dependability of these systems, it also suggests solutions like sophisticated machine learning approaches, data augmentation tactics, and the building of thorough dialect-specific corpus. The results show that customized strategies for every dialect together with cross-dialect training greatly enhance recognition performance.

[1104] EmoSSLSphere: Multilingual Emotional Speech Synthesis with Spherical Vectors and Discrete Speech Tokens

Joonyong Park, Kenichi Nakamura

Main category: eess.AS

TL;DR: EmoSSLSphere is a multilingual emotional TTS framework using spherical emotion vectors and SSL-based discrete tokens for fine-grained emotional control and cross-lingual emotion transfer.

DetailsMotivation: To enable fine-grained emotional control, effective cross-lingual emotion transfer, and robust speaker identity preservation in multilingual emotional TTS synthesis.

Method: Combines spherical emotion vectors (continuous spherical coordinate space) with discrete token features from self-supervised learning for semantic and acoustic modeling.

Result: Significant improvements in speech intelligibility, spectral fidelity, prosodic consistency, and overall synthesis quality on English and Japanese corpora. Outperforms baselines in naturalness and emotional expressiveness.

Conclusion: EmoSSLSphere shows potential as a scalable solution for multilingual emotional TTS with superior performance in emotional control and cross-lingual transfer.

Abstract: This paper introduces EmoSSLSphere, a novel framework for multilingual emotional text-to-speech (TTS) synthesis that combines spherical emotion vectors with discrete token features derived from self-supervised learning (SSL). By encoding emotions in a continuous spherical coordinate space and leveraging SSL-based representations for semantic and acoustic modeling, EmoSSLSphere enables fine-grained emotional control, effective cross-lingual emotion transfer, and robust preservation of speaker identity. We evaluate EmoSSLSphere on English and Japanese corpora, demonstrating significant improvements in speech intelligibility, spectral fidelity, prosodic consistency, and overall synthesis quality. Subjective evaluations further confirm that our method outperforms baseline models in terms of naturalness and emotional expressiveness, underscoring its potential as a scalable solution for multilingual emotional TTS.

eess.IV

[1105] Towards Robust and Generalizable Continuous Space-Time Video Super-Resolution with Events

Shuoyan Wei, Feng Li, Shengeng Tang, Runmin Cong, Yao Zhao, Meng Wang, Huihui Bai

Main category: eess.IV

TL;DR: EvEnhancer is a novel approach for continuous space-time video super-resolution that uses event streams to achieve robust and generalizable performance at arbitrary spatial and temporal scales, with EvEnhancerPlus adding a controllable switching mechanism for adaptive reconstruction.

DetailsMotivation: Existing continuous space-time video super-resolution methods often generalize poorly to out-of-distribution scales, producing unsatisfactory results when applied to arbitrary spatial and temporal resolutions.

Method: The approach combines event-adapted synthesis to capture long-term motion trajectories using spatiotemporal correlations between frames and events, with a local implicit video transformer that integrates cross-scale spatiotemporal attention. EvEnhancerPlus adds a controllable switching mechanism that dynamically determines reconstruction difficulty per pixel based on local event statistics, along with a cross-derivative training strategy for stable convergence.

Result: Extensive experiments show state-of-the-art performance on both synthetic and real-world datasets, with superior generalizability at out-of-distribution scales while substantially reducing computational overhead.

Conclusion: The proposed EvEnhancer framework successfully addresses the generalization limitations of existing C-STVSR methods by leveraging event streams and adaptive reconstruction mechanisms, achieving robust performance across arbitrary spatial and temporal scales.

Abstract: Continuous space-time video super-resolution (C-STVSR) has garnered increasing interest for its capability to reconstruct high-resolution and high-frame-rate videos at arbitrary spatial and temporal scales. However, prevailing methods often generalize poorly, producing unsatisfactory results when applied to out-of-distribution (OOD) scales. To overcome this limitation, we present EvEnhancer, a novel approach that marries the unique properties of high temporal resolution and high dynamic range encapsulated in event streams to achieve robust and generalizable C-STVSR. Our approach incorporates event-adapted synthesis that capitalizes on the spatiotemporal correlations between frames and events to capture long-term motion trajectories, enabling adaptive interpolation and fusion across space and time. This is then coupled with a local implicit video transformer that integrates local implicit video neural function with cross-scale spatiotemporal attention to learn continuous video representations and generate plausible videos at arbitrary resolutions and frame rates. We further develop EvEnhancerPlus, which builds a controllable switching mechanism that dynamically determines the reconstruction difficulty for each spatiotemporal pixel based on local event statistics. This allows the model to adaptively route reconstruction along the most suitable pathways at a fine-grained pixel level, substantially reducing computational overhead while maintaining excellent performance. Furthermore, we devise a cross-derivative training strategy that stabilizes the convergence of such a multi-pathway framework through staged cross-optimization. Extensive experiments demonstrate that our method achieves state-of-the-art performance on both synthetic and real-world datasets, while maintaining superior generalizability at OOD scales. The code is available at https://github.com/W-Shuoyan/EvEnhancerPlus.

[1106] Real-time nonlinear inversion of magnetic resonance elastography with operator learning

Juampablo E. Heras Rivera, Caitlin M. Neher, Mehmet Kurt

Main category: eess.IV

TL;DR: Developed an operator learning framework (oNLI) for real-time brain MRE inversion that achieves 30,000x speedup while maintaining spatial accuracy comparable to nonlinear inversion methods.

DetailsMotivation: To enable real-time inversion of brain magnetic resonance elastography (MRE) data while maintaining the spatial accuracy of nonlinear inversion methods, overcoming computational limitations of traditional approaches.

Method: Used a predictive deep operator learning framework trained with 10-fold cross-validation on 3D MRE data from 61 individuals, incorporating a structural prior mechanism similar to Soft Prior Regularization. Inputs were complex curl of displacement fields and outputs were NLI-derived reference elastograms.

Result: oNLI achieved significantly lower absolute percent error (8.4±0.5 for μ’ and 10.0±0.7 for μ’’) compared to CNN baselines (15.8±0.8 for μ’ and 26.1±1.1 for μ’’). Outperformed CNNs across all brain regions with statistical significance (p < 0.05) for both storage and loss moduli.

Conclusion: The oNLI framework enables real-time MRE inversion with 30,000x speedup while maintaining fine-grained spatial accuracy comparable to nonlinear inversion, outperforming CNN-based approaches.

Abstract: $\textbf{Purpose:}$ To develop and evaluate an operator learning framework for nonlinear inversion (NLI) of brain magnetic resonance elastography (MRE) data, which enables real-time inversion of elastograms with comparable spatial accuracy to NLI. $\textbf{Materials and Methods:}$ In this retrospective study, 3D MRE data from 61 individuals (mean age, 37.4 years; 34 female) were used for development of the framework. A predictive deep operator learning framework (oNLI) was trained using 10-fold cross-validation, with the complex curl of the measured displacement field as inputs and NLI-derived reference elastograms as outputs. A structural prior mechanism, analogous to Soft Prior Regularization in the MRE literature, was incorporated to improve spatial accuracy. Subject-level evaluation metrics included Pearson’s correlation coefficient, absolute relative error, and structural similarity index measure between predicted and reference elastograms across brain regions of different sizes to understand accuracy. Statistical analyses included paired t-tests comparing the proposed oNLI variants to the convolutional neural network baselines. $\textbf{Results:}$ Whole brain absolute percent error was 8.4 $\pm$ 0.5 ($\mu’$) and 10.0 $\pm$ 0.7 ($\mu’’$) for oNLI and 15.8 $\pm$ 0.8 ($\mu’$) and 26.1 $\pm$ 1.1 ($\mu’’$) for CNNs. Additionally, oNLI outperformed convolutional architectures as per Pearson’s correlation coefficient, $r$, in the whole brain and across all subregions for both the storage modulus and loss modulus (p < 0.05). $\textbf{Conclusion:}$ The oNLI framework enables real-time MRE inversion (30,000x speedup), outperforming CNN-based approaches and maintaining the fine-grained spatial accuracy achievable with NLI in the brain.

[1107] How We Won BraTS-SSA 2025: Brain Tumor Segmentation in the Sub-Saharan African Population Using Segmentation-Aware Data Augmentation and Model Ensembling

Claudia Takyi Ankomah, Livingstone Eli Ayivor, Ireneaus Nyame, Leslie Wambo, Patrick Yeboah Bonsu, Aondona Moses Iorumbur, Raymond Confidence, Toufiq Musah

Main category: eess.IV

TL;DR: This paper presents a brain tumor segmentation approach using data augmentation and model ensembling to improve performance on underrepresented datasets like BraTS-Africa.

DetailsMotivation: Brain tumors, especially gliomas, are challenging to diagnose due to complex growth patterns and individual brain variability. Existing deep learning models trained on homogeneous datasets lack robustness for underserved regions.

Method: Used segmentation-aware offline data augmentation on BraTS-Africa dataset and constructed an ensemble of three architectures: MedNeXt, SegMamba, and Residual-Encoder U-Net to leverage complementary strengths.

Result: MedNeXt trained for 1000 epochs achieved highest average lesion-wise dice (0.86) and normalized surface distance (0.81) scores. The ensemble model trained for 500 epochs produced the most balanced segmentation performance across tumor subregions.

Conclusion: Combination of advanced augmentation and model ensembling can improve segmentation accuracy and robustness on diverse and underrepresented datasets.

Abstract: Brain tumors, particularly gliomas, pose significant chall-enges due to their complex growth patterns, infiltrative nature, and the variability in brain structure across individuals, which makes accurate diagnosis and monitoring difficult. Deep learning models have been developed to accurately delineate these tumors. However, most of these models were trained on relatively homogenous high-resource datasets, limiting their robustness when deployed in underserved regions. In this study, we performed segmentation-aware offline data augmentation on the BraTS-Africa dataset to increase the data sample size and diversity to enhance generalization. We further constructed an ensemble of three distinct architectures, MedNeXt, SegMamba, and Residual-Encoder U-Net, to leverage their complementary strengths. Our best-performing model, MedNeXt, was trained on 1000 epochs and achieved the highest average lesion-wise dice and normalized surface distance scores of 0.86 and 0.81 respectively. However, the ensemble model trained for 500 epochs produced the most balanced segmentation performance across the tumour subregions. This work demonstrates that a combination of advanced augmentation and model ensembling can improve segmentation accuracy and robustness on diverse and underrepresented datasets. Code available at: https://github.com/SPARK-Academy-2025/SPARK-2025/tree/main/SPARK2025_BraTs_MODELS/SPARK_NeuroAshanti

[1108] ReTiDe: Real-Time Denoising for Energy-Efficient Motion Picture Processing with FPGAs

Changhong Li, Clément Bled, Rosa Fernandez, Shreejith Shanker

Main category: eess.IV

TL;DR: ReTiDe is a hardware-accelerated denoising system using FPGAs that achieves 37.71× GOPS throughput and 5.29× higher energy efficiency than prior FPGA denoisers, with minimal quality degradation.

DetailsMotivation: Current deep denoisers are computationally intensive and expensive to deploy on GPUs for real-time, high-resolution video streams, creating need for more efficient solutions.

Method: Uses a compact convolutional model quantized to INT8, compiled for AMD DPU-based FPGAs, with client-server integration that offloads computation from host CPU/GPU to networked FPGA service.

Result: Achieves 37.71× GOPS throughput and 5.29× higher energy efficiency compared to prior FPGA denoising accelerators, with negligible PSNR/SSIM degradation.

Conclusion: Specialized FPGA accelerators can provide practical, scalable denoising for encoding pipelines and post-production, reducing energy consumption without sacrificing quality or workflow compatibility.

Abstract: Denoising is a core operation in modern video pipelines. In codecs, in-loop filters suppress sensor noise and quantisation artefacts to improve rate-distortion performance; in cinema post-production, denoisers are used for restoration, grain management, and plate clean-up. However, state-of-the-art deep denoisers are computationally intensive and, at scale, are typically deployed on GPUs, incurring high power and cost for real-time, high-resolution streams. This paper presents Real-Time Denoise (ReTiDe), a hardware-accelerated denoising system that serves inference on data-centre Field Programmable Gate Arrays (FPGAs). A compact convolutional model is quantised (post-training quantisation plus quantisation-aware fine-tuning) to INT8 and compiled for AMD Deep Learning Processor Unit (DPU)-based FPGAs. A client-server integration offloads computation from the host CPU/GPU to a networked FPGA service, while remaining callable from existing workflows, e.g., NUKE, without disrupting artist tooling. On representative benchmarks, ReTiDe delivers 37.71$\times$ Giga Operations Per Second (GOPS) throughput and 5.29$\times$ higher energy efficiency than prior FPGA denoising accelerators, with negligible degradation in Peak Signal-to-Noise Ratio (PSNR)/Structural Similarity Index (SSIM). These results indicate that specialised accelerators can provide practical, scalable denoising for both encoding pipelines and post-production, reducing energy per frame without sacrificing quality or workflow compatibility. Code is available at https://github.com/RCSL-TCD/ReTiDe.

[1109] AI-Assisted Pleural Effusion Volume Estimation from Contrast-Enhanced CT Images

Sanhita Basu, Tomas Fröding, Ali Teymur Kahraman, Dimitris Toumpanakis, Tobias Sjöblom

Main category: eess.IV

TL;DR: Developed a semi-supervised deep learning framework (TTAS) for pleural effusion segmentation from CT scans, achieving superior performance over state-of-the-art models.

DetailsMotivation: Accurate measurement of pleural effusion volume from CT scans is challenging but important for clinical management.

Method: Used retrospective CTPA data with manual annotations for 100 cases. Developed Teacher-Teaching Assistant-Student (TTAS) semi-supervised framework for efficient training on non-segmented examinations.

Result: TTAS achieved mean Dice score of 0.82 vs 0.73 for nnU-Net (p<0.0001) and four-fold lower mean Absolute Volume Difference (6.49 mL vs 23.16 mL, p<0.0001).

Conclusion: The TTAS framework provides superior pleural effusion segmentation, enabling accurate volume determination from CT scans.

Abstract: Background: Pleural Effusions (PE) is a common finding in many different clinical conditions, but accurately measuring their volume from CT scans is challenging. Purpose: To improve PE segmentation and quantification for enhanced clinical management, we have developed and trained a semi-supervised deep learning framework on contrast-enhanced CT volumes. Materials and Methods: This retrospective study collected CT Pulmonary Angiogram (CTPA) data from internal and external datasets. A subset of 100 cases was manually annotated for model training, while the remaining cases were used for testing and validation. A novel semi-supervised deep learning framework, Teacher-Teaching Assistant-Student (TTAS), was developed and used to enable efficient training in non-segmented examinations. Segmentation performance was compared to that of state-of-the-art models. Results: 100 patients (mean age, 72 years, 28 [standard deviation]; 55 men) were included in the study. The TTAS model demonstrated superior segmentation performance compared to state-of-the-art models, achieving a mean Dice score of 0.82 (95% CI, 0.79 - 0.84) versus 0.73 for nnU-Net (p < 0.0001, Student’s T test). Additionally, TTAS exhibited a four-fold lower mean Absolute Volume Difference (AbVD) of 6.49 mL (95% CI, 4.80

  • 8.20) compared to nnU-Net’s AbVD of 23.16 mL (p < 0.0001). Conclusion: The developed TTAS framework offered superior PE segmentation, aiding accurate volume determination from CT scans.

[1110] Sliding Window Attention for Learned Video Compression

Alexander Kopte, André Kaup

Main category: eess.IV

TL;DR: 3D Sliding Window Attention (SWA) replaces patch-based local attention in video compression transformers, eliminating architectural flaws and computational redundancy while improving rate-distortion performance and reducing complexity.

DetailsMotivation: To address the architectural flaws and computational inefficiencies of patch-based local attention mechanisms in video compression transformers, particularly the irregular receptive fields and redundant overlapping windows in temporal autoregressive models.

Method: Introduces 3D Sliding Window Attention (SWA), a patchless form of local attention that enables a decoder-only architecture unifying spatial and temporal context processing with uniform receptive fields.

Result: Achieves up to 18.6% Bjøntegaard Delta-rate savings against VCT baseline, reduces decoder complexity by 2.8x, and makes entropy model 3.5x more efficient. Analysis shows benefits from long-range temporal context but performance degrades with excessive context.

Conclusion: 3D Sliding Window Attention provides a superior alternative to patch-based approaches, significantly improving video compression performance while reducing computational complexity through its patchless architecture and uniform receptive fields.

Abstract: To manage the complexity of transformers in video compression, local attention mechanisms are a practical necessity. The common approach of partitioning frames into patches, however, creates architectural flaws like irregular receptive fields. When adapted for temporal autoregressive models, this paradigm, exemplified by the Video Compression Transformer (VCT), also necessitates computationally redundant overlapping windows. This work introduces 3D Sliding Window Attention (SWA), a patchless form of local attention. By enabling a decoder-only architecture that unifies spatial and temporal context processing, and by providing a uniform receptive field, our method significantly improves rate-distortion performance, achieving Bj{\o}rntegaard Delta-rate savings of up to 18.6 % against the VCT baseline. Simultaneously, by eliminating the need for overlapping windows, our method reduces overall decoder complexity by a factor of 2.8, while its entropy model is nearly 3.5 times more efficient. We further analyze our model’s behavior and show that while it benefits from long-range temporal context, excessive context can degrade performance.

[1111] The method of the approximate inverse for limited-angle CT

Bernadette Hahn, Gael Rigaud, Richard Schmähl

Main category: eess.IV

TL;DR: A new model-driven approach called CLARK is proposed for limited-angle CT reconstruction, combining spectral filtering, approximate inverse method, and edge-preserving denoising to eliminate streak artifacts without requiring large datasets.

DetailsMotivation: Limited-angle CT enables faster data acquisition and safer medical scans but standard methods like FBP and total-variation produce artifacts. Deep learning methods require large datasets, creating a need for model-driven approaches that work with limited data.

Method: The method uses reconstruction kernels precomputed as solutions to auxiliary problems (LARK), then develops CLARK by combining spectral filtering, approximate inverse method, and custom edge-preserving denoising to handle ill-conditioning and stabilize the process.

Result: The approach successfully eliminates streak artifacts even for large limited angles, and handles the ill-conditioning inherited from the limited-angle Radon transform. Validation on synthetic and real data shows effectiveness.

Conclusion: CLARK provides a stable, model-driven solution for limited-angle CT reconstruction that avoids streak artifacts without requiring large datasets, serving as a promising foundation for future learning strategies.

Abstract: Limited-angle computerized tomography stands for one of the most difficult challenges in imaging. Although it opens the way to faster data acquisition in industry and less dangerous scans in medicine, standard approaches, such as the filtered backprojection (FBP) algorithm or the widely used total-variation functional, often produce various artefacts that hinder the diagnosis. With the rise of deep learning, many modern techniques have proven themselves successful in removing such artefacts but at the cost of large datasets. In this paper, we propose a new model-driven approach based on the method of the approximate inverse, which could serve as new starting point for learning strategies in the future. In contrast to FBP-type approaches, our reconstruction step consists in evaluating linear functionals on the measured data using reconstruction kernels that are precomputed as solution of an auxiliary problem. With this problem being uniquely solvable, the derived limited-angle reconstruction kernel (LARK) is able to fully reconstruct the object without the well-known streak artefacts, even for large limited angles. However, it inherits severe ill-conditioning which leads to a different kind of artefacts arising from the singular functions of the limited-angle Radon transform. The problem becomes particularly challenging when working on semi-discrete (real or analytical) measurements. We develop a general regularization strategy, named constrained limited-angle reconstruction kernel (CLARK), by combining spectral filter, the method of the approximate inverse and custom edge-preserving denoising in order to stabilize the whole process. We further derive and interpret error estimates for the application on real, i.e. semi-discrete, data and we validate our approach on synthetic and real data.

[1112] Adaptive double-phase Rudin–Osher–Fatemi denoising model

Wojciech Górny, Michał Łasica, Alexandros Matsoukas

Main category: eess.IV

TL;DR: A new image denoising model using variable-growth total variation regularization with adaptive weight to reduce staircasing while preserving edges.

DetailsMotivation: To address the staircasing effect in the classical Rudin-Osher-Fatemi model while maintaining edge preservation capabilities.

Method: Variable-growth total variation regularization of double-phase type with adaptive weight, implemented and tested on synthetic and natural images in 1D and 2D.

Result: The model was tested over a range of noise levels on both synthetic and natural images.

Conclusion: The proposed model effectively reduces staircasing while preserving edges similar to the classical approach.

Abstract: We propose a new image denoising model based on a variable-growth total variation regularization of double-phase type with adaptive weight. It is designed to reduce staircasing with respect to the classical Rudin–Osher–Fatemi model, while preserving the edges of the image in a similar fashion. We implement the model and test its performance on synthetic and natural images in 1D and 2D over a range of noise levels.

[1113] Robust MRI Reconstruction by Smoothed Unrolling (SMUG)

Shijun Liang, Van Hoang Minh Nguyen, Jinghan Jia, Ismail Alkhouri, Sijia Liu, Saiprasad Ravishankar

Main category: eess.IV

TL;DR: SMUG is a robust MRI reconstruction framework that combines deep unrolling with randomized smoothing to improve model stability against input perturbations, noise variations, and sampling rate changes.

DetailsMotivation: DL-based MRI reconstruction models are overly sensitive to minor input disturbances, leading to unstable and aliased images. There's a need for robust techniques that can handle train-test variations.

Method: Proposes Smoothed Unrolling (SMUG) framework that customizes randomized smoothing based on deep unrolling architecture, rather than applying it to the entire model. Uses randomized smoothing to improve tolerance against input noises.

Result: SMUG improves robustness against various instability sources including worst-case and random noise perturbations, varying measurement sampling rates, and different unrolling steps. Outperforms vanilla randomized smoothing approach.

Conclusion: SMUG provides an effective robust learning approach for MRI reconstruction that addresses sensitivity issues in DL models through customized randomized smoothing on unrolling architecture.

Abstract: As the popularity of deep learning (DL) in the field of magnetic resonance imaging (MRI) continues to rise, recent research has indicated that DL-based MRI reconstruction models might be excessively sensitive to minor input disturbances, including worst-case additive perturbations. This sensitivity often leads to unstable, aliased images. This raises the question of how to devise DL techniques for MRI reconstruction that can be robust to train-test variations. To address this problem, we propose a novel image reconstruction framework, termed Smoothed Unrolling (SMUG), which advances a deep unrolling-based MRI reconstruction model using a randomized smoothing (RS)-based robust learning approach. RS, which improves the tolerance of a model against input noises, has been widely used in the design of adversarial defense approaches for image classification tasks. Yet, we find that the conventional design that applies RS to the entire DL-based MRI model is ineffective. In this paper, we show that SMUG and its variants address the above issue by customizing the RS process based on the unrolling architecture of a DL-based MRI reconstruction model. Compared to the vanilla RS approach, we show that SMUG improves the robustness of MRI reconstruction with respect to a diverse set of instability sources, including worst-case and random noise perturbations to input measurements, varying measurement sampling rates, and different numbers of unrolling steps. Furthermore, we theoretically analyze the robustness of our method in the presence of perturbations.

[1114] ResSR: A Computationally Efficient Residual Approach to Super-Resolving Multispectral Images

Haley Duba-Sullivan, Emma J. Reid, Sophie Voisin, Charles A. Bouman, Gregery T. Buzzard

Main category: eess.IV

TL;DR: ResSR is a computationally efficient multispectral image super-resolution method that uses spectral decomposition and spatial residual correction to upsample low-resolution bands without expensive iterative deconvolution or training-based approaches.

DetailsMotivation: Existing multispectral image super-resolution methods are computationally expensive due to spatially regularized deconvolution and training-based approaches, limiting practical applications.

Method: Uses singular value decomposition for spectral correlation identification, pixel-wise computation for upsampling, and residual correction for high-spatial frequency component correction. Formulated as spatially-coupled optimization but solved with pixel-wise regularization for non-iterative efficiency.

Result: ResSR is 2× to 10× faster than alternative MSI-SR algorithms while producing comparable or better image quality on both simulated and measured data.

Conclusion: ResSR provides an efficient non-iterative solution for multispectral image super-resolution that achieves high-quality reconstructions with significantly reduced computational cost compared to existing methods.

Abstract: Multispectral imaging sensors typically have wavelength-dependent resolution, which limits downstream processing. Consequently, researchers have proposed multispectral image super-resolution (MSI-SR) methods which upsample low-resolution bands to achieve a common resolution across all wavelengths. However, existing MSI-SR methods are computationally expensive because they require spatially regularized deconvolution and/or training-based methods. In this paper, we introduce ResSR, a computationally efficient MSI-SR method that achieves high-quality reconstructions by using spectral decomposition along with spatial residual correction. ResSR applies singular value decomposition to identify correlations across spectral bands, uses pixel-wise computation to upsample the MSI, and then applies a residual correction process to correct the high-spatial frequency components of the upsampled bands. While ResSR is formulated as the solution to a spatially-coupled optimization problem, we use pixel-wise regularization and derive an approximate non-iterative solution, resulting in a computationally efficient, non-iterative algorithm. Results on a combination of simulated and measured data show that ResSR is 2$\times$ to 10$\times$ faster than alternative MSI-SR algorithms, while producing comparable or better image quality. Code is available at https://github.com/hdsullivan/ResSR.

[1115] Segmenting Bi-Atrial Structures Using ResNext Based Framework

Malitha Gunawardhana, Mark L Trew, Gregory B Sands, Jichao Zhao

Main category: eess.IV

TL;DR: TASSNet is a two-stage deep learning framework for automated segmentation of both left and right atria from 3D LGE-MRI, addressing challenges in atrial fibrillation management.

DetailsMotivation: Manual segmentation of atrial structures from LGE-MRI is time-consuming, operator-dependent, and prone to variability, creating a need for automated solutions to guide ablation strategies in persistent AF.

Method: TASSNet uses a ResNeXt-based encoder for enhanced feature extraction and a cyclical learning rate schedule to handle convergence instability in imbalanced 3D segmentation tasks.

Result: TASSNet successfully segmented atrial structures with high accuracy on both in-distribution and out-of-distribution datasets without additional training.

Conclusion: TASSNet shows potential for robust and reproducible bi-atrial segmentation, enabling advanced fibrosis quantification and personalized ablation planning in clinical AF management.

Abstract: Atrial Fibrillation (AF), the most common sustained cardiac arrhythmia worldwide, increasingly requires accurate bi-atrial structural assessment to guide ablation strategies, particularly in persistent AF. Late gadolinium-enhanced magnetic resonance imaging (LGE-MRI) enables visualisation of atrial fibrosis, but precise manual segmentation remains time-consuming, operator-dependent, and prone to variability. We propose TASSNet, a novel two-stage deep learning framework for fully automated segmentation of both left atrium (LA) and right atrium (RA), including atrial walls and cavities, from 3D LGE-MRI. TASSNet introduces two main innovations: (i) a ResNeXt-based encoder to enhance feature extraction from limited medical datasets, and (ii) a cyclical learning rate schedule to address convergence instability in highly imbalanced, small-batch 3D segmentation tasks. We evaluated our method on two datasets, one of which was completely out-of-distribution, without any additional training. In both cases, TASSNet successfully segmented atrial structures with high accuracy. These results highlight TASSNet’s potential for robust and reproducible bi-atrial segmentation, enabling advanced fibrosis quantification and personalised ablation planning in clinical AF management.

[1116] Conformalized Generative Bayesian Imaging: An Uncertainty Quantification Framework for Computational Imaging

Canberk Ekmekci, Mujdat Cetin

Main category: eess.IV

TL;DR: A scalable framework for jointly quantifying both aleatoric (inherent) and epistemic (model) uncertainties in computational imaging, combining generative models with Bayesian neural networks and conformal prediction for calibrated uncertainty estimates.

DetailsMotivation: Current methods only quantify either aleatoric uncertainty (using generative models) or epistemic uncertainty (using Bayesian neural networks) separately, but there's a need for joint quantification of both uncertainty types in trustworthy computational imaging.

Method: Proposes a framework that accepts existing generative model-based posterior sampling methods and adds epistemic uncertainty quantification through Bayesian neural networks with latent variables and deep ensembling, with conformal prediction for calibration.

Result: Evaluated on MRI, CT, and image inpainting, showing that the framework produces uncertainty estimates with characteristic features of true epistemic and aleatoric uncertainties, and conformal prediction enables marginal coverage guarantees.

Conclusion: The framework successfully quantifies both uncertainty types jointly, provides calibrated uncertainty estimates through conformal prediction, and demonstrates practical utility across multiple imaging modalities.

Abstract: Uncertainty quantification plays an important role in achieving trustworthy and reliable learning-based computational imaging. Recent advances in generative modeling and Bayesian neural networks have enabled the development of uncertainty-aware image reconstruction methods. Current generative model-based methods seek to quantify the inherent (aleatoric) uncertainty on the underlying image for given measurements by learning to sample from the posterior distribution of the underlying image. On the other hand, Bayesian neural network-based approaches aim to quantify the model (epistemic) uncertainty on the parameters of a deep neural network-based reconstruction method by approximating the posterior distribution of those parameters. Unfortunately, an ongoing need for an inversion method that can jointly quantify complex aleatoric uncertainty and epistemic uncertainty patterns still persists. In this paper, we present a scalable framework that can quantify both aleatoric and epistemic uncertainties. The proposed framework accepts an existing generative model-based posterior sampling method as an input and introduces an epistemic uncertainty quantification capability through Bayesian neural networks with latent variables and deep ensembling. Furthermore, by leveraging the conformal prediction methodology, the proposed framework can be easily calibrated to ensure rigorous uncertainty quantification. We evaluated the proposed framework on magnetic resonance imaging, computed tomography, and image inpainting problems and showed that the epistemic and aleatoric uncertainty estimates produced by the proposed framework display the characteristic features of true epistemic and aleatoric uncertainties. Furthermore, our results demonstrated that the use of conformal prediction on top of the proposed framework enables marginal coverage guarantees consistent with frequentist principles.

[1117] Filling of incomplete sinograms from sparse PET detector configurations using a residual U-Net

Klara Leffler, Luigi Tommaso Luppino, Samuel Kuttner, Karin Söderkvist, Jan Axelsson

Main category: eess.IV

TL;DR: A deep learning approach using a modified Residual U-Net to restore missing sinogram data in sparse PET scanner configurations, enabling cost-effective total body PET imaging while maintaining acceptable image quality.

DetailsMotivation: To address the high cost of densely packed photodetectors in long axial field-of-view PET scanners by developing sparse system configurations with deep learning restoration, making total body PET more clinically accessible.

Method: Modified Residual U-Net trained on clinical PET scans from GE Signa PET/MR, simulating removal of 50% detectors in chessboard pattern (retaining only 25% lines of response) to restore missing sinogram data.

Result: Successfully recovered missing counts with mean absolute error below two events per pixel, outperforming 2D interpolation in both sinogram and reconstructed image domains, though reconstructed images lacked sharpness in finer details.

Conclusion: Sparse detector configurations combined with deep learning offer a viable alternative to conventional PET designs, supporting development of cost-effective total body PET scanners as a significant advancement in medical imaging.

Abstract: Long axial field-of-view PET scanners offer increased field-of-view and sensitivity compared to traditional PET scanners. However, a significant cost is associated with the densely packed photodetectors required for the extended-coverage systems, limiting clinical utilisation. To mitigate the cost limitations, alternative sparse system configurations have been proposed, allowing an extended field-of-view PET design with detector costs similar to a standard PET system, albeit at the expense of image quality. In this work, we propose a deep sinogram restoration network to fill in the missing sinogram data. Our method utilises a modified Residual U-Net, trained on clinical PET scans from a GE Signa PET/MR, simulating the removal of 50% of the detectors in a chessboard pattern (retaining only 25% of all lines of response). The model successfully recovers missing counts, with a mean absolute error below two events per pixel, outperforming 2D interpolation in both sinogram and reconstructed image domain. Notably, the predicted sinograms exhibit a smoothing effect, leading to reconstructed images lacking sharpness in finer details. Despite these limitations, the model demonstrates a substantial capacity for compensating for the undersampling caused by the sparse detector configuration. This proof-of-concept study suggests that sparse detector configurations, combined with deep learning techniques, offer a viable alternative to conventional PET scanner designs. This approach supports the development of cost-effective, total body PET scanners, allowing a significant step forward in medical imaging technology.

[1118] RAM-W600: A Multi-Task Wrist Dataset and Benchmark for Rheumatoid Arthritis

Songxiao Yang, Haolin Wang, Yao Fu, Ye Tian, Tamotsu Kamishima, Masayuki Ikebe, Yafei Ou, Masatoshi Okutomi

Main category: eess.IV

TL;DR: This paper introduces the first public multi-task dataset for wrist bone analysis in conventional radiography, focusing on rheumatoid arthritis diagnosis with instance segmentation and bone erosion scoring.

DetailsMotivation: Limited CAD research for wrist RA due to challenges in acquiring high-quality annotations: complex wrist anatomy with small bones and narrow joints, and disease progression changes that require rheumatology expertise.

Method: Created a dataset of 1048 wrist radiographs from 388 patients across 6 medical centers, with pixel-level instance segmentation annotations for 618 images and SvdH bone erosion scores for 800 images.

Result: Established the first public resource for wrist bone instance segmentation, providing a foundation for various RA research tasks including joint space narrowing quantification, bone erosion detection, and bone deformity evaluation.

Conclusion: This dataset will significantly lower barriers to wrist RA research and accelerate progress in computer-aided diagnosis for rheumatoid arthritis and other wrist-related medical tasks.

Abstract: Rheumatoid arthritis (RA) is a common autoimmune disease that has been the focus of research in computer-aided diagnosis (CAD) and disease monitoring. In clinical settings, conventional radiography (CR) is widely used for the screening and evaluation of RA due to its low cost and accessibility. The wrist is a critical region for the diagnosis of RA. However, CAD research in this area remains limited, primarily due to the challenges in acquiring high-quality instance-level annotations. (i) The wrist comprises numerous small bones with narrow joint spaces, complex structures, and frequent overlaps, requiring detailed anatomical knowledge for accurate annotation. (ii) Disease progression in RA often leads to osteophyte, bone erosion (BE), and even bony ankylosis, which alter bone morphology and increase annotation difficulty, necessitating expertise in rheumatology. This work presents a multi-task dataset for wrist bone in CR, including two tasks: (i) wrist bone instance segmentation and (ii) Sharp/van der Heijde (SvdH) BE scoring, which is the first public resource for wrist bone instance segmentation. This dataset comprises 1048 wrist conventional radiographs of 388 patients from six medical centers, with pixel-level instance segmentation annotations for 618 images and SvdH BE scores for 800 images. This dataset can potentially support a wide range of research tasks related to RA, including joint space narrowing (JSN) progression quantification, BE detection, bone deformity evaluation, and osteophyte detection. It may also be applied to other wrist-related tasks, such as carpal bone fracture localization. We hope this dataset will significantly lower the barrier to research on wrist RA and accelerate progress in CAD research within the RA-related domain.

[1119] Depth-Sequence Transformer (DST) for Segment-Specific ICA Calcification Mapping on Non-Contrast CT

Xiangjian Hou, Ebru Yaman Akcicek, Xin Wang, Kazem Hashemizadeh, Scott Mcnally, Chun Yuan, Xiaodong Ma

Main category: eess.IV

TL;DR: The paper introduces a Depth-Sequence Transformer (DST) framework that reformulates 3D carotid artery calcification analysis as parallel probabilistic landmark localization along the 1D axial dimension, achieving high accuracy in segment-specific quantification.

DetailsMotivation: Total intracranial carotid artery calcification volume ignores the critical influence of plaque location, and conventional 3D models sacrifice global context when processing downsampled volumes or isolated patches, making segment-specific quantification technically infeasible.

Method: Proposed Depth-Sequence Transformer (DST) framework processes full-resolution CT volumes as sequences of 2D slices, learning to predict 6 independent probability distributions that pinpoint key anatomical landmarks through parallel probabilistic landmark localization.

Result: Achieved Mean Absolute Error of 0.1 slices with 96% of predictions within ±1 slice tolerance on 100-patient clinical cohort using 5-fold cross-validation. Also established best result on Clean-CC-CCII classification benchmark.

Conclusion: The work delivers the first practical tool for automated segment-specific ICAC analysis and provides a foundation for studying location-specific biomarkers in diagnosis, prognosis, and procedural planning.

Abstract: While total intracranial carotid artery calcification (ICAC) volume is an established stroke biomarker, growing evidence shows this aggregate metric ignores the critical influence of plaque location, since calcification in different segments carries distinct prognostic and procedural risks. However, a finer-grained, segment-specific quantification has remained technically infeasible. Conventional 3D models are forced to process downsampled volumes or isolated patches, sacrificing the global context required to resolve anatomical ambiguity and render reliable landmark localization. To overcome this, we reformulate the 3D challenge as a \textbf{Parallel Probabilistic Landmark Localization} task along the 1D axial dimension. We propose the \textbf{Depth-Sequence Transformer (DST)}, a framework that processes full-resolution CT volumes as sequences of 2D slices, learning to predict $N=6$ independent probability distributions that pinpoint key anatomical landmarks. Our DST framework demonstrates exceptional accuracy and robustness. Evaluated on a 100-patient clinical cohort with rigorous 5-fold cross-validation, it achieves a Mean Absolute Error (MAE) of \textbf{0.1 slices}, with \textbf{96%} of predictions falling within a $\pm1$ slice tolerance. Furthermore, to validate its architectural power, the DST backbone establishes the best result on the public Clean-CC-CCII classification benchmark under an end-to-end evaluation protocol. Our work delivers the first practical tool for automated segment-specific ICAC analysis. The proposed framework provides a foundation for further studies on the role of location-specific biomarkers in diagnosis, prognosis, and procedural planning.

[1120] Multisession Longitudinal Dynamic MRI Incorporating Patient-Specific Prior Image Information Across Time

Jingjia Chen, Hersh Chandarana, Daniel K. Sodickson, Li Feng

Main category: eess.IV

TL;DR: Proposes longitudinal dynamic MRI that uses patient-specific prior images across multiple sessions to improve reconstruction quality and accelerate data acquisition.

DetailsMotivation: Existing MRI reconstruction methods process each session independently, missing the opportunity to leverage shared anatomical and motion information across longitudinal imaging sessions.

Method: Concatenates multi-session time-resolved 4D GRASP datasets into extended dynamic series and applies low-rank subspace-based reconstruction algorithm.

Result: Longitudinal 4D GRASP reconstruction consistently outperforms standard single-session reconstruction in image quality while preserving inter-session variations, and demonstrates robustness to anatomical changes.

Conclusion: The work introduces a new context-aware imaging paradigm where repeated patient imaging enables faster subsequent scans, improving efficiency and consistency in longitudinal MRI.

Abstract: Serial Magnetic Resonance Imaging (MRI) exams are often performed in clinical practice, offering shared anatomical and motion information across imaging sessions. However, existing reconstruction methods process each session independently without leveraging this valuable longitudinal information. In this work, we propose a novel concept of longitudinal dynamic MRI, which incorporates patient-specific prior images to exploit temporal correlations across sessions. This framework enables progressive acceleration of data acquisition and reduction of scan time as more imaging sessions become available. The concept is demonstrated using the 4D Golden-angle RAdial Sparse Parallel (GRASP) MRI, a state-of-the-art dynamic imaging technique. Longitudinal reconstruction is performed by concatenating multi-session time-resolved 4D GRASP datasets into an extended dynamic series, followed by a low-rank subspace-based reconstruction algorithm. A series of experiments were conducted to evaluate the feasibility and performance of the proposed method. Results show that longitudinal 4D GRASP reconstruction consistently outperforms standard single-session reconstruction in image quality, while preserving inter-session variations. The approach demonstrated robustness to changes in anatomy, imaging intervals, and body contour, highlighting its potential for improving imaging efficiency and consistency in longitudinal MRI applications. More generally, this work suggests a new context-aware imaging paradigm in which the more we see a patient, the faster we can image.

[1121] EMedNeXt: An Enhanced Brain Tumor Segmentation Framework for Sub-Saharan Africa using MedNeXt V2 with Deep Supervision

Ahmed Jaheen, Abdelrahman Elsayed, Damir Kim, Daniil Tikhonov, Matheus Scatolin, Mohor Banerjee, Qiankun Ji, Mostafa Salem, Hu Wang, Sarim Hashmi, Mohammad Yaqub

Main category: eess.IV

TL;DR: EMedNeXt is an enhanced brain tumor segmentation framework designed for sub-Saharan Africa that addresses challenges of low-quality MRI scanners and scarce radiology expertise through improved architecture and robust ensembling.

DetailsMotivation: Manual MRI segmentation for brain tumor diagnosis is time-consuming, requires expert radiologists, and is infeasible in under-resourced healthcare systems, especially in low-income regions with poor MRI quality and limited radiology expertise.

Method: Enhanced MedNeXt V2 framework with deep supervision and optimized post-processing, featuring larger region of interest, improved nnU-Net v2-based architecture, and robust model ensembling system tailored for sub-Saharan Africa.

Result: Achieved average LesionWise DSC of 0.897 with LesionWise NSD of 0.541 and 0.84 at tolerances of 0.5 mm and 1.0 mm respectively on hidden validation set.

Conclusion: EMedNeXt provides an effective solution for robust brain tumor segmentation in resource-constrained settings like sub-Saharan Africa, addressing image quality degradation and limited expertise challenges.

Abstract: Brain cancer affects millions worldwide, and in nearly every clinical setting, doctors rely on magnetic resonance imaging (MRI) to diagnose and monitor gliomas. However, the current standard for tumor quantification through manual segmentation of multi-parametric MRI is time-consuming, requires expert radiologists, and is often infeasible in under-resourced healthcare systems. This problem is especially pronounced in low-income regions, where MRI scanners are of lower quality and radiology expertise is scarce, leading to incorrect segmentation and quantification. In addition, the number of acquired MRI scans in Africa is typically small. To address these challenges, the BraTS-Lighthouse 2025 Challenge focuses on robust tumor segmentation in sub-Saharan Africa (SSA), where resource constraints and image quality degradation introduce significant shifts. In this study, we present EMedNeXt – an enhanced brain tumor segmentation framework based on MedNeXt V2 with deep supervision and optimized post-processing pipelines tailored for SSA. EMedNeXt introduces three key contributions: a larger region of interest, an improved nnU-Net v2-based architectural skeleton, and a robust model ensembling system. Evaluated on the hidden validation set, our solution achieved an average LesionWise DSC of 0.897 with an average LesionWise NSD of 0.541 and 0.84 at a tolerance of 0.5 mm and 1.0 mm, respectively.

Last updated: 2025-10-13
Built with Hugo, theme modified on Stack